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INTRODUCTION 

This  study  investigated  a  computer-aided  diagnosis  (CADx)  system  for  breast  cancer  by 
combining  the  following  three  data  sources:  mammogram  films,  radiologist-interpreted  BI-RADS 
descriptors,  and  proteomic  profiles  of  blood  sera. 

Although  mammography  is  the  modality  of  choice  for  early  detection  of  breast  cancer'’^,  it  has  a 
low  positive  predictive  value  (PPV).  As  a  result,  only  15  to  34%  of  women  with  radiographically- 
suspicious,  nonpalpable  lesions  are  actually  found  to  have  a  malignancy  by  histologic  diagnosis  after 
biopsy.^  '*  The  excessive  biopsy  of  benign  lesions  raises  the  cost  of  mammographic  screening^  and 
results  in  emotional  and  physical  burden  to  the  patients,  as  well  as  financial  burden  to  society. 

In  addition  to  mammography,  both  BI-RADS  descriptors^  and  clinical  proteomics®  have  been 
useful  in  differentiating  benign  from  malignant  breast  masses.  The  combination  of  mammographic 
and  proteomic  information  can  lead  to  a  more  specific  classifier  for  difficult  cases.  Ensemble 
classifiers  for  breast  cancer  combining  multiple  sources  of  information  have  been  shown  to 
outperform  classifiers  using  only  one  of  the  information  sources.^ 

This  research  has  two  purposes.  The  first  is  to  create  three  separate  classifiers  for  breast  cancer 
based  on  proteomic  information,  mammogram  information,  and  radiologist-interpreted.  The  second 
is  to  combine  the  outputs  of  these  three  first-stage  classifiers  into  one  ensemble  classifier  for  breast 
cancer,  which  will  outperform  any  of  the  component  classifiers. 

BODY 

Task  I.  Build  a  Bayesian  regression  model  classifier  for  breast  cancer  based  on  image 
features  of  digitized  mammograms.  Evaluate  the  model  performance  using  honest  leave- 
one-out  cross-validation  (LOOCV)  with  the  ROC  area  as  the  performance  metric. 

Calculate  the  Bayesian  posterior  classification  probability  intervals  to  provide  an  honest 
assessment  of  the  uncertainties  of  the  predictive  classifications.  (Months  1-12) 

This  task  has  already  been  completed  during  the  current,  first  year  and  has  resulted  in  one 
accepted  and  one  submitted  peer-reviewed  publication  as  well  as  one  full-length  conference 
proceedings  paper  (see  #1,  #2,  and  #3  in  Reportable  Outcomes).  On  each  digitized  mammogram, 
a  512x512  region  of  interest  (ROI)  centered  on  the  centroid  of  each  calcification  cluster  was 
extracted.  The  automated  image-processing  scheme  consisted  of  the  following  steps:  (1)  pre¬ 
processing  using  unsharp  masking,  (2)  segmentation  of  individual  calcifications  using  a  back- 
propagation  artificial  neural  network  (BP- ANN)  classifier,  and  (3)  cluster  classification  using 
another  BP- ANN  classifier  to  reduce  the  number  of  false  positive  clusters.  For  each  cluster,  the 
algorithm  calculated  22  image-processing  features,  consisting  mostly  of  shape  features  for  the 
calcifications  and  calcification  clusters  and  of  texture  features  for  ROIs  centered  on  the  clusters. 

Once  the  features  had  been  extracted  from  the  mammogram,  they  were  used  to  distinguish 
benign  from  malignant  calcification  lesions  by  classification  models.  In  addition  to  Bayesian 
probit  regression  models,  for  comparison  we  also  applied  two  well-established  CADx  classifiers, 
linear  discriminant  analysis  (EDA),  artificial  neural  network  (ANN).  We  also  applied  two 
variants  of  a  novel  classifier,  decision  fusion:  decision  fusion  to  maximize  the  area  under  the 
ROC  curve  (DF-A),  and  to  maximize  the  high- sensitivity  region  (TPF  >  0.90)  partial  area  (DF- 
P).  Decision  fusion  was  a  novel  classification  method  (See  #1  in  Reportable  Outcomes).  Figure 
la  shows  the  ROC  curve  for  the  Bayesian  probit  regression,  and  Figure  lb  shows  the  set  of  ROC 
curves  for  the  classifiers’  performances  under  100-fold  cross  validation  were  AUC  =  0.73  for 
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Bayesian  probit  regression,  0.68  ±  0.01  for  LDA,  0.76  ±  0.01  for  ANN,  0.85  ±  0.01  for  DF-A, 
and  0.82  ±  0.01  for  DF-P.  Decision  fusion  significantly  outperformed  the  other  classifiers  (p  < 
0.001). 

GLM  ROC,  AUC=  0.7305  ROC  Curves  for  the  Calcification  Data  Set 


Figure  la:  Bayesian  probit  regression  Figure  lb:  LDA,  ANN,  and  decision  fusion 

Task  2.  Build  a  Bayesian  regression  model  classifier  for  breast  cancer  based  patient  age 
and  BI-RADS  features  from  radiologists.  Evaluate  the  model  performance  and 
classification  uncertainties  as  in  Aim  1.  (Months  13-16) 

This  task  has  already  been  completed  and  has  resulted  in  publications  (see  #1  and  #2  in 
Reportable  Outcomes).  The  mammographic  findings  for  each  case  in  our  database  have  been 
interpreted  by  dedicated  breast  imaging  radiologists  using  the  Breast  Imaging  Reporting  and 
Data  System  (BI-RADS)  lexicon  from  the  American  College  of  Radiology.^  The  BI-RADS 
lexicon  provides  categorical  descriptions  (findings)  for  each  mammographic  feature. 

While  the  original  research  proposal  focused  only  on  microcalcification  lesions,  we  have 
responded  to  one  of  the  proposal  reviewers  and  have  extended  the  research  project  to  include 
masses  as  well.  Including  masses  will  lend  additional  clinical  relevance  to  this  project.  Currently, 
the  radiologist-interpreted  BI-RADS  features  are  available  only  for  mass  cases. 

All  of  the  classifiers  were  able  to  distinguish  benign  from  malignant  lesions  well.  The 
classifiers’  performances  under  100-fold  cross  validation  were  AUC  =  0.94  for  Bayesian  probit 
regression,  0.93  ±  0.01  for  LDA,  0.93  ±  O.Olfor  ANN,  0.94  ±  O.Olfor  DF-A,  and  0.93  ±  0.01  for 
DF-P.  Decision  fusion  had  a  slight  performance  gain  over  the  ANN  and  LDA  (p  =  0.02),  but  was 
comparable  to  Bayesian  probit  regression.  The  ROC  curves  of  these  classifiers  are  shown  in 
Figures  2a  and  2b. 
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GLM  ROC,  AUC=  0.9439 


ROC  Curves  for  the  Mass  Data  Set 


Figure  2a:  Bayesian  probit  regression 


Figure  2b:  LDA,  ANN,  and  decision  fusion 


Task  3.  Build  a  Bayesian  regression  model  classifier  for  breast  cancer  based  on 
proteomic  profiles  of  blood  serum  samples.  Evaluate  the  model  performance  and 
classification  uncertainties  as  in  Aim  1.  (Months  16-28) 

We  have  done  some  preliminary  work  on  the  proteomics  data,  and  our  research  here  is  still  a 
work  in  progress. 

Women  undergoing  diagnostic  biopsy  at  Duke  University  Medical  Center  for  breast  cancer 
between  2000-2004  were  enrolled  in  this  study.  Before  cytoreductive  surgery,  women  were 
consented  for  the  study  and  blood  was  obtained.  Serum,  plasma,  and  white  blood  cells  were 
aliquoted  and  cryogenically  stored.  Two  sets  were  constructed  from  these  samples:  1)  Forty-two 
women  over  the  age  of  55  with  benign  breast  findings  and  2)  Forty-six  women  over  the  age  of  55 
with  invasive  breast  cancers  greater  than  1.5  cm.  In  addition,  sera  from  120  healthy  women  were 
used  for  controls. 

While  the  original  research  proposal  included  proteomic  data  from  mass  spectrometry 
spectra,  these  spectra  were  found  to  be  too  noisy  for  the  purposes  of  classifying  malignant  from 
benign  lesions.  We  are  now  using  the  much  more  specific  Enzyme-Linked  ImmunoSorbent 
Assay  (ELISA)  protocol  to  extract  information  about  blood  serum  proteins.  Sera  were  assayed 
for  52  different  biomarkers  using  the  Luminex  platform  and  reagents.  Because  these  biomarkers 
are  expensive  to  collect,  we  are  currently  trying  to  identify  a  subset  of  important  proteins  by 
exploring  feature-selection  techniques  on  our  proteomics  pilot  data  set.  Once  the  important 
proteins  have  been  identified,  more  cases  will  be  collected,  allowing  for  further  modeling  and 
classifying. 

Task  4.  Combine  tbe  outputs  of  tbe  three  Bayesian  regression  models  into  one  ensemble 
classifier  for  breast  cancer  diagnosis  prediction.  Evaluate  the  model  performance  using  the 
ROC  area  as  the  performance  metric.  (Months  28-36) 

Once  we  have  finalized  all  three  of  the  separate  models  described  above,  we  will  combine 
them  into  one  ensemble  classifier. 

KEY  RESEARCH  ACCOMPLISHMENTS 

•  Developed  a  decision  fusion  model  to  combine  various  information  sources 
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•  Classified  the  mammogram  and  BI-RADS  data  sets  using  the  following  classification 
models:  Bayesian  probit  regression,  linear  discriminant  analysis,  artificial  neural  network, 
and  decision  fusion 

•  Established  an  internal  collaboration  as  a  data  source  for  the  proteomics  data  set,  and 
initiated  preliminary  analysis  of  that  data  set. 


CONCLUSIONS 

The  current  work  focuses  on  combining  breast  imaging  and  proteomics  information  for 
breast  cancer  diagnosis.  This  study  is  structured  in  two  stages:  (1)  build  classification  models  on 
each  of  the  individual  data  sources,  and  (2)  combine  the  models  into  one  ensemble  classifier. 

One  significant  research  outcome  was  the  development  of  a  decision  fusion  classification 
algorithm.  Decision  fusion  has  the  benefit  of  being  robust  in  very  noisy  data  sets,  such  as  the 
calcification  and  proteomics  data  sets.  On  the  more  challenging  calcification  data  set,  decision 
fusion  outperformed  the  other  classifiers  by  achieving  AUC  =  0.85  ±  0.01.  On  the  BI-RADS  data 
set,  all  classifiers  performed  well,  with  decision  fusion  still  performing  the  best  with  AUC  =  0.94 
±0.01. 

The  proteomics  work  is  still  a  work  in  progress,  due  to  the  relatively  small  number  of  cases 
that  are  currently  available  as  well  as  the  large  number  of  noisy  features  in  the  data  set.  In  future 
work,  we  will  identify  a  subset  of  blood  serum  proteins  that  are  useful  for  breast  cancer 
classification.  Once  these  proteins  have  been  identified,  we  will  collect  more  cases  to  increase 
the  size  of  the  proteomics  data  set.  With  a  larger  data  set,  we  can  construct  predictive  models. 
Finally,  once  these  models  for  all  three  data  sets  have  been  finalized,  we  will  combine  them  into 
one  ensemble  classifier. 


REPORTABLE  OUTCOMES 

The  following  publications  are  attached  as  appendices  1-4  with  the  same  numbers.  The  names  of 

the  fellow  (Jesneck)  and  mentor  (Lo)  are  boldfaced  for  emphasis. 

1  Jesneck  JL,  Nolle  LW,  Baker  JA,  Floyd  CE,  Lo  JY,  “An  optimized  approach  to  decision 
fusion  of  heterogeneous  data  for  Breast  Cancer  Diagnosis,  ”  Medical  Physics,  (in  press) 

2  Jesneck  JL,  Lo  JY,  Baker  JA,  “A  computer  aid  for  diagnosis  of  breast  mass  lesions  using 
both  mammographic  and  sonographic  BI-RADS  descriptors,”  Radiology,  (submitted) 

3  Jesneck  JL,  Nolle  LW,  Baker  JA,  Lo  JY,  “The  effect  of  data  set  size  on  computer-aided 
diagnosis  of  breast  cancer:  Comparing  decision  fusion  to  a  linear  discriminant,”  in  SPIE 
medical  Imaging  2006:  Image  Processing  (2006) 
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Abstract 


As  more  diagnostic  testing  options  become  available  to  physicians,  it  becomes  more 
difficult  to  combine  various  types  of  medical  information  together  in  order  to  optimize  the 
overall  diagnosis.  To  improve  diagnostic  performance,  here  we  introduce  an  approach  to 
optimize  a  decision-fusion  technique  to  combine  heterogeneous  information,  such  as 
from  different  modalities,  feature  categories,  or  institutions.  For  classifier  comparison  we 
used  two  performance  metrics:  the  ROC  area  under  the  curve  (AUC)  and  the  normalized 
partial  area  under  the  curve  (pAUC).  This  study  used  four  classifiers:  linear  discriminant 
analysis  (LDA),  artificial  neural  network  (ANN),  and  two  variants  of  our  decision-fusion 
technique,  ADC-optimized  (DF-A)  and  pAUC-optimized  (DF-P)  decision  fusion.  We 
applied  each  of  these  classifiers  with  100-fold  cross  validation  to  two  heterogeneous 
breast  cancer  data  sets:  one  of  mass  lesion  features  and  a  much  more  challenging  one 
of  microcalcification  lesion  features.  For  the  calcification  data  set,  DF-A  outperformed  the 
other  classifiers  in  terms  of  AUC  (p  <  0.02)  and  achieved  AUC  =  0.85  ±  0.01 .  The  DF-P 
surpassed  the  other  classifiers  in  terms  of  pAUC  (p  <  0.01)  and  reached  pAUC  =  0.38  ± 
0.02.  For  the  mass  data  set,  DF-A  outperformed  both  the  ANN  and  the  LDA  (p  <  0.04) 
and  achieved  AUC  =  0.94  ±  0.01 .  Although  for  this  data  set  there  were  no  statistically 
significant  differences  among  the  classifiers’  pAUC  values  (pAUC  =  0.57  ±  0.07  to  0.67  ± 
0.05,  p  >  0.10),  the  DF-P  did  significantly  improve  specificity  versus  the  LDA  at  both  98% 
and  100%  sensitivity  (p  <  0.04).  In  conclusion,  decision  fusion  directly  optimized  clinically 
significant  performance  measures  such  as  AUC  and  pAUC,  and  sometimes 
outperformed  two  well  known  machine-learning  techniques  when  applied  to  two  different 
breast  cancer  data  sets. 
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I.  Introduction 

Breast  cancer  accounts  for  one-third  of  all  cancer  diagnoses  among  American  women, 
has  the  second  highest  mortality  rate  of  all  cancer  deaths  in  women  \  and  is  expected  to 
account  for  15%  of  all  cancer  deaths  in  2005  Early  diagnosis  and  treatment  can 
significantly  improve  the  chance  of  survival  for  breast  cancer  patients  Currently, 
mammography  is  the  preferred  screening  method  for  breast  cancer.  However,  high  false 
positive  rates  reduce  the  effectiveness  of  screening  mammography,  as  several  studies 
have  shown  that  only  13-29%  of  suspicious  masses  are  determined  to  be  malignant  ®  ®. 
Unnecessary  surgical  biopsies  are  expensive,  cause  patient  anxiety,  alter  cosmetic 
appearance,  and  can  distort  future  mammograms  . 

Commercial  products  for  computer-aided  detection  (CAD)  have  shown  promise  for 
improving  sensitivity  in  large  clinical  trials.  Most  studies  to  date  have  shown  CAD  to 
boost  radiologists’  lesion  detection  sensitivity  ®  ®  To  date,  however,  there  are  no 
commercial  systems  to  improve  specificity  for  breast  cancer  screening.  To  fill  this  need 
to  improve  the  sensitivity  of  mammography,  computer-aided  diagnosis  (CADx)  has 
emerged  as  a  promising  clinical  aid 

There  has  been  considerable  CAD  and  CADx  research  based  upon  a  rich  variety  of 
modalities  and  sources  of  medical  information  such  as:  digitized  screen-film 
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,  MRI 


mammograms  full-field  digital  mammograms  sonograms 

images  and  gene  expression  profiles  Current  clinically  implemented  CADx 
programs  tend  to  use  only  one  information  source,  although  multimodality  CADx 
programs  are  beginning  to  emerge.  Moreover,  most  CADx  research  has  been 
performed  using  relatively  homogeneous  data  sets  collected  at  one  institution,  acquired 
using  one  type  of  digitizer  or  digital  detector,  or  using  features  drawn  from  one  source 
such  as  human-interpreted  findings  versus  computer-extracted  features.  Increasingly 
however  there  is  a  trend  towards  boosting  diagnostic  performance  by  combining 
together  data  from  many  different  sources  to  create  heterogeneous  data.  We  defined 
heterogeneous  data  as  comprising  multiple,  distinct  groups.  Specifically,  for  this  study 
we  considered  as  heterogeneous  any  of  the  following  data  set  characteristics:  multiple 
imaging  modalities,  multiple  types  of  mammogram  film  digitizers,  data  collected  from 
multiple  institutions,  and  various  types  of  features  extracted  from  the  same  image, 
especially  computer-extracted  and  human-extracted  features.  Combining  heterogeneous 
data  types  for  classification  is  a  difficult  machine-learning  problem,  but  one  that  has 
shown  promise  in  bioinformatics  applications 

To  meet  the  challenge  of  combining  heterogeneous  data  types,  we  turned  to  a  decision- 
fusion  method  that  operates  by  the  following  two  steps:  1 .  Classifiers  use  feature 
subsets  to  generate  initial  binary  decisions,  and  2.  These  binary  decisions  are  then 
combined  optimally  using  decision  fusion  theory.  Decision  fusion  offers  the  following 
advantages:  It  handles  heterogeneous  data  sources  well,  reduces  the  problem 
dimensionality,  is  easily  interpretable,  and  is  easy  to  use  in  a  clinical  setting.  Decision 
fusion  has  effectively  combined  heterogeneous  data  in  many  diverse  classification  tasks. 
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such  as  detecting  land  mines  using  multiple  sensors  identifying  persons  using  multiple 
biometrics  and  CADx  of  endoscopic  images  using  multiple  sets  of  medical  features 

The  purpose  of  this  study  was  to  optimize  a  decision-fusion  approach  for  classifying 
heterogeneous  breast  cancer  data.  We  compared  this  decision-fusion  approach  to  a 
linear  discriminant  and  an  artificial  neural  network,  which  are  well-studied  techniques 
that  have  frequently  been  applied  to  breast  cancer  CADx  This  study  evaluates 

these  classification  algorithms  on  two  breast  cancer  data  sets  using  two  different 
clinically  relevant  performance  metrics. 

II.  Methods 
A.  Data 

For  this  study,  we  chose  two  different  breast  cancer  data  sets,  which  differed 
considerably  in  the  type  and  number  of  patient  cases  as  well  as  the  type  and  number  of 
medical  information  features  describing  those  cases. 

Microcalcification  Lesions 

Data  set  C  consisted  of  all  1508  mammogram  microcalcification  lesions  from  the  Digital 
Database  for  Screening  Mammography  (DDSM)  The  outcomes  were  verified  by 
histological  diagnosis  and  follow-up  for  certain  benign  cases,  yielding  81 1  benign  and 
697  malignant  calcification  lesions.  Figure  1  shows  the  feature  group  structure  of  this 
data  set.  The  feature  groups  were  13  computer-extracted  calcification  cluster 
morphological  features,  91  computer-extracted  texture  features  of  the  lesion  background 
anatomy,  2  radiologist-interpreted  findings,  3  radiologist-extracted  features  from  the 
Breast  Imaging  Reporting  and  Data  System  (BI-RADS™,  American  College  of  Radiology, 
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Reston,  VA)  and  patient  age.  In  total,  data  set  C  had  1 10  features  and  a  sample-to- 
feature  ratio  of  approximately  14:1 .  Each  mammogram  was  digitized  with  one  of  four 
digitizers:  a  DBA  M2100  ImageClear  at  a  resolution  of  42  microns,  a  Howtek  960  at  43.5 
microns,  a  Howtek  MultiRad850  at  43.5  microns,  or  a  Lumisys  200  Laser  at  50  microns. 
To  study  this  large,  heterogeneous  data  set,  no  attempt  was  made  to  restrict  cases  only 
to  a  single  digitizer,  as  was  common  in  most  previous  studies.  Moreover,  no 
standardization  step  was  applied  to  the  images  to  correct  for  the  differences  in  noise, 
resolution,  and  other  physical  characteristics  from  the  various  digitizers.  We  used  a 
512x512  pixel  ROI  centered  on  the  centroid  of  each  lesion  (using  lesion  outlines  drawn 
by  the  DDSM  radiologists)  for  image  processing  and  for  generating  the  computer- 
extracted  features.  We  extracted  morphological  and  texture  (spatial  gray  level 
dependence  matrix)  features,  which  were  shown  to  be  useful  in  a  previous  study  of 
CADx  by  Chan  et  al 

This  data  set  had  many  heterogenic  characteristics,  such  as  that  it  was  collected  at  four 
different  institutions,  scanned  on  four  types  of  digitizers  with  different  physical 
characteristics,  and  included  both  human-extracted  and  computer-extracted  features, 
such  as  shape  and  texture  features. 

Mass  Lesions 

Data  set  M  consisted  of  568  breast  mass  cases  that  were  collected  in  the  Radiology 
Department  of  Duke  University  Health  System  between  1999  and  2001.  These  cases 
were  an  extension  of  the  data  set  described  in  detail  in  our  previous  studies 
Definitive  histopathologic  diagnosis  from  biopsy  was  used  to  determine  outcome. 
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yielding  370  benign  and  198  malignant  mass  lesions.  Figure  2  shows  the  feature  group 
structure  of  this  data  set.  Dedicated  breast  radiologists  recorded  all  features. 

The  mass  data  set  was  heterogeneous  because  it  was  comprised  of  3  distinct  types  of 
data:  13  mammogram  features,  23  sonogram  features  in  turn  drawn  from  3  different 
lexicons  (Ultrasound  BI-RADS,  Stavros,  and  others)  as  well  as  3  patient  history 
features.  In  total  data  set  M  had  39  features  and  a  sample-to-feature  ratio  of 
approximately  15:1 . 

B.  Decision  Fusion 

There  is  a  growing  literature  in  the  area  of  distributed  detection.  Although  there  is  even 
some  earlier  work,  several  of  the  early  classical  references  include  the  work  of  Tenney 
and  Sandell,  who  introduced  distributed  detection  using  a  fixed  fusion  processor  and 
optimized  the  local  processors  Chair  and  Varshney  fixed  the  local  processors,  and 

optimized  the  fusion  processor  Reibman  and  Nolte  extended  these  previous  studies 

by  simultaneous  optimization  of  the  local  detectors  while  deriving  the  overall  optimum 
fusion  design  Dasarathy  summarizes  some  of  the  earlier  work  '‘V 

Decision  fusion  theory  describes  how  to  combine  local  binary  decisions  optimally  to 
determine  the  presence  or  absence  of  a  signal  in  noise  The  local  binary 

decisions  can  come  from  any  arbitrary  source. 

Figure  3  provides  a  schematic  of  our  decision-fusion  method.  Our  algorithm  is  a  two- 
stage  process,  each  with  a  likelihood  ratio  calculation.  The  first  stage  applies  a  separate 
likelihood  ratio  to  each  feature.  These  feature-level  likelihood  ratios  are  then  compared 
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to  separate  thresholds  to  generate  feature-level  decisions.  These  feature-level  decisions 
are  then  fused  in  the  second  stage  by  computing  the  likelihood  ratio  of  the  binary 
decision  values.  The  second  stage  combines  the  feature-level  decisions  into  one  fused 
likelihood-ratio  value,  which  can  be  used  as  a  classification  decision  variable. 

Our  technique  offers  the  important  advantage  that  it  can  reduce  the  dimensionality  of  the 
feature  space  of  the  classification  problem  by  assigning  a  classifier  to  each  feature 
separately.  Considering  only  one  feature  at  a  time  greatly  reduces  the  complexity  of  the 
problem  by  avoiding  the  need  to  estimate  multidimensional  probability  density  functions 
(PDFs)  of  the  feature  space.  Accurately  estimating  such  multidimensional  PDFs  likely 
requires  many  more  observations  than  a  typical  medical  data  set  contains.  Other 
benefits  of  decision  fusion  are  that  it  is  robust  in  noisy  data is  not  overly  sensitive  to 
the  likelihood  ratio  threshold  values, and  can  handle  missing  data  values  Our 
decision-fusion  technique  can  also  be  tuned  to  maximize  arbitrary  performance  metrics 
(as  described  later  in  Section  II  C)  that  may  be  more  clinically  relevant,  unlike  more 
traditional  classification  algorithms  that  minimize  mean  squared  error. 

1.  Detection  Theory  Approach  -  the  Likelihood  Ratio 

Although  decision  fusion  combines  binary  decisions  regardless  of  how  those  decisions 
were  made,  it  is  still  important  to  choose  the  right  initial  classifiers  in  order  to  pass  as 
much  information  to  the  decision  fuser  as  possible.  In  our  algorithm,  we  used  the 
likelihood  ratio  as  the  initial  classifier  and  applied  a  threshold  to  generate  the  binary 
decisions  on  each  feature.  Previous  work  has  shown  the  likelihood  ratio  to  be  an 
excellent  classifier  for  breast  cancer  mass  lesion  data 
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According  to  decision  theory,  the  likelihood  ratio  is  the  optimal  detector  to  determine  the 
presence  or  absence  of  a  signal  in  noise  .  For  this  study,  the  signal  to  be  detected  was 
the  potential  malignancy  of  a  breast  lesion.  The  null  hypothesis  was  that  the  signal 
(malignancy)  is  not  present  in  the  noisy  features,  while  the  alternative  hypothesis  {H^) 
was  that  the  signal  is  present. 

H.:X  =  N 

(1) 

H,:X  =  S  +  N 

Sources  of  noise  in  the  features  included  anatomical  noise  inherent  in  the  mammogram 
or  sonogram,  quantum  noise  in  the  acquisition  of  the  mammogram  or  sonogram, 
digitization  noise  and  artifacts  for  data  set  C,  and  ambiguities  in  the  mammogram 
reading  process  for  the  radiologist-interpreted  findings  in  both  data  sets  C  and  M. 

The  likelihood  ratio  is  the  probability  of  the  features  under  the  malignant  case  divided  by 
the  probability  of  the  features  under  the  benign  case: 

P{X\H,) 

^  p(x\H,y  ^  ’ 

where  P(X  I//,)  is  the  PDF  of  the  observation  data  X  given  that  the  signal  is  present, 
and  P(X  I  //(,)  is  the  PDF  of  the  data  X  given  that  the  signal  is  not  present.  The 
likelihood  ratio  is  optimal  under  the  assumption  that  the  PDFs  accurately  reflect  the  true 
densities.  We  estimated  the  one-dimensional  PDFs  of  the  features  with  histograms.  We 
used  Scott’s  rule  to  determine  the  optimal  histogram  bin  width, 

h  =  2>.5on-^'\  (3) 

where  h  is  the  bin  width,  o  is  the  standard  deviation  and  n  is  the  number  of 
observations.  The  interval  of  two  standard  deviations  around  the  mean,  {ix-lo,pL  +  2o^, 
was  then  subdivided  by  the  bin  width,  h.  We  assigned  the  values  falling  outside  this 
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interval  to  the  extreme  left  or  right  bins.  Next,  we  applied  a  threshold  value,  r,  to  the 


likelihood  ratio  to  produce  a  binary  decision  about  the  presence  of  the  signal. 


u  = 


^feature 
0  Mature 


>  T 
<  T 


(4) 


2.  Fusing  the  Binary  Decisions 

For  the  signal-plus-noise  hypothesis  the  probability  of  detecting  an  existing  signal 
\sP(u  =  l\H^)  =  Pd  and  of  missing  it  isP(M  =  OI//j)  =  l-PJ.  For  the  noise-only 
hypothesis  the  probability  of  false  detection  is  P(u  =  1  I/Zq)  =  Pf  of  correctly 
rejecting  the  missing  signal  is  P(u  =  0  I =  l-Pf.  Using  these  probabilities,  the 
likelihood  ratio  value  of  a  binary  decision  variable  has  a  simple  form,  as  shown  in 
Equation  (5). 


A 


'decision 


(u)  = 


P(u\H^) 

P(u\H,) 


M 

Pf 

I- Pd 


[l-Pf 


if  M  =  1 


if  M  =  0 


(5) 


We  can  then  use  the  likelihood  ratios  of  the  individual  local  decision  variables  to 
calculate  the  joint  likelihood  ratio  of  the  set  of  decision  variables.  Assuming  that  the  local 
decision  variables  are  statistically  independent,  the  likelihood  ratio  of  the  fused  classifier 
is  a  product  of  the  likelihood  ratios  of  the  individual  local  decisions. 


A 


=  YlKecisioni^i)  =  "  n 


l-Uj 


(6) 


Note  that  we  assume  statistical  independence  of  only  the  local  binary  decisions,  not  of 
the  sensitivity,  false-positive  rate,  or  even  the  features  on  which  the  local  decisions  were 
made. 
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In  our  decision  fusion  theory  approach  we  have  made  the  important  assumption  that  all 
the  local  decisions  are  statistically  independent.  While  this  appears  to  be  a  very  strong 
assumption,  using  it  in  decision  fusion  often  does  not  lower  classification  performance 
substantially  below  the  performance  of  the  optimal  decision  fusion  processor  for 
correlated  decisions.  Although  we  can  construct  an  optimal  correlated  decision  fusion 
processor  with  known  decision  correlations  it  is  difficult  to  estimate  the  correlation 
structure  of  the  decisions  accurately,  especially  given  many  decisions  but  only  few 
observations.  However,  even  with  correlated  decisions,  the  simplifying  assumption  of 
independent  decisions  often  does  not  lower  decision  fusion  performance.  Liao  et  al. 
have  shown  that,  under  certain  conditions  for  the  case  of  fusing  two  correlated  decisions, 
the  independent  fusion  processor  exactly  matched  the  performance  of  the  optimal 
correlated  decision  fusion  processor.  Even  in  many  situations  when  the  optimality 
conditions  were  not  kept,  the  degradation  of  the  fusion  performance  was  not  significant 
Another  benefit  of  the  independent  local  decisions  assumption  is  that  decision  fusion 
can  usually  recover  from  weak  signals  and  correlated  features  given  enough  decisions  to 
fuse  Because  we  have  a  large  number  of  local  decisions  by  setting  a  separate  local 
decision  for  each  feature,  our  algorithm  takes  advantage  of  this  performance  benefit. 

C.  Classifier  Evaluation  and  Figures  of  Merit 

We  used  the  ROC  curve  to  capture  the  classification  performance  of  our  decision-fusion 
algorithm.  Assuming  independent  local  decisions,  the  probability  density  functions 
(PDFs)  of  the  decision  fusion  likelihood  ratio  have  a  similar  product  form 


10 


(7) 


/=1 

i=\ 

Using  the  fusion  likelihood  ratio  value  as  a  classification  decision  variable,  the 
probabilities  of  detection  and  false  alarm  are  calculated  as  follows: 

^fusion  ^ 

where  ^  is  a  threshold  on  that  determines  the  operating  point  on  the  ROC  curve. 
By  varying  the  value  of  the  threshold  p,  these  and  Pffusio„(P)  values  trace  the 

entire  decision-fusion  ROC  curve. 

One  can  use  the  ROC  curve  to  quantify  classification  performance  by  calculating 
summary  metrics  of  the  curve.  Certain  performance  metrics  have  more  significance  in  a 
clinical  setting  than  others,  especially  when  high  sensitivity  must  be  maintained.  This 
study  used  two  clinically  interesting  summary  metrics  of  the  ROC  curve:  the  area  under 
the  curve  (AUC),  and  the  normalized  partial  area  under  the  curve  (pAUC)  above  a 
certain  sensitivity  value  For  this  study,  we  set  the  sensitivity  value  TPF  =  0.90  for 
pAUC  to  reflect  that  diagnosing  breast  cancer  at  high  sensitivities  is  clinically  imperative. 
We  used  the  non-parametric  bootstrap  method  to  measure  the  means  and  variances 
of  the  AUC  and  pAUC  values  as  well  as  to  compare  metrics  from  two  models  for 
statistical  significance. 

D.  Genetic  Algorithm  Search  for  the  Optimal  Threshold  Set 
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The  selection  of  the  likelihood-ratio  threshold  values  is  important  to  maximize 
performance  of  the  fused  classifier.  Threshold  values  very  far  from  the  best  values  often 
lowered  the  fused  classifier’s  performance  to  near  chance  levels.  A  genetic  algorithm 
searched  over  the  likelihood-ratio  threshold  values  for  each  feature  to  select  a  threshold 
set  that  maximized  the  desired  performance  metric  or  figure  of  merit  (FOM), 

optimal  =  argmax  (m;t))  ,  (8) 

where  the  FOM  is  either  AUC  or  pAUC,  u  is  the  set  of  local  decisions,  and  r  is  the  set 
of  feature-level  likelihood-ratio  thresholds.  The  fitness  function  of  the  genetic  algorithm 
was  set  to  the  FOM  in  order  to  maximize  the  FOM  value.  We  optimized  for  cross- 
validation  performance  the  following  genetic  algorithm  parameters:  the  number  of 
generations,  population  size,  and  rates  of  selection,  crossover,  and  mutation. 

E.  Decision  Fusion  with  Cross  Validation 

We  used  k-fold  cross  validation  (k=100)  to  estimate  the  ability  of  the  classifiers  to 
generalize  on  our  data  sets.  For  each  fold,  a  new  model  was  developed,  i.e.,  the 
likelihood  ratio  was  formed  on  the  k-1  subsets  (99%  of  cases)  used  as  training  samples, 
and  the  genetic  algorithm  searched  over  the  thresholds  to  maximize  the  performance 
metric  on  these  training  samples.  Once  the  best  thresholds  had  been  found  on  the 
training  set,  they  were  then  used  to  evaluate  the  algorithm  on  the  one  subset  (1%  of 
cases)  withheld  for  validation.  The  resulting  local  decisions  were  then  combined  into  the 
fused  validation  likelihood  ratio  as  in  Equation  (6).  The  process  was  then 

repeated  k  times  by  withholding  a  different  subset  for  validation,  such  that  all  cases  are 
used  for  training  and  validation  while  simultaneously  ensuring  independence  between 
those  subsets. 
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Compiling  all  values  at  the  end  of  the  cross  validation  computations  created  a 

distribution  of  of  fh®  fsst  cases.  We  constructed  an  ROC  curve  from  the 

\est,fmion(^)  values,  as  in  Equation  (8),  in  order  to  measure  the  classification 
performance  of  the  decision-fusion  classifier  with  k-fold  cross  validation. 

F.  Using  Decision  Fusion  in  a  Diagnostic  Setting 

Once  the  model  has  been  fully  trained  and  validated,  it  can  similarly  be  applied  to  new 
cases  by  setting  all  of  the  existing  data  to  be  the  training  data  and  applying  the  new 
clinical  case  as  a  new  validation  case.  The  decision-fusion  algorithm  would  recommend 
to  the  physician  either  a  biopsy  with  a  malignant  classification  or  short-term  follow-up 
with  a  very  likely  benign  classification. 

G.  Other  Classifiers:  Artificial  Neural  Network  and  Linear  Discriminant 

We  compared  the  classification  performance  of  the  decision  fusion  against  both  an 
artificial  neural  network  (ANN)  and  Fisher’s  linear  discriminant  analysis  (LDA),  which  are 
well-understood  algorithms  and  are  popular  breast  cancer  CADx  research  tools. 

For  the  ANN,  we  used  a  fully-connected,  feed-forward,  error  backpropagation  network 
with  a  hidden  layer  of  5  nodes,  implemented  using  the  nnet  package  (version  7.2-20)  for 
R  statistical  software  (version  1.12,  the  R  Project  for  Statistical  Computing).  For  the  LDA, 
we  used  the  Statistics  Toolbox  (version  5.1)  of  MATLAB®  (Release  14,  Service  Pack  2, 
Mathworks  Inc,  Natick  MA).  Both  models  were  carefully  verified  against  custom  software 
previously  developed  within  our  group.  We  implemented  our  decision-fusion  algorithm  in 
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MATLAB,  relying  specifically  on  the  Genetic  Algorithm  and  Direct  Search  Toolbox 
(version  2)  to  find  the  best  thresholds  for  the  likelihood  ratio  values. 

III.  Results 

A.  Classifier  Performance  on  Data  Set  C  (Calcification  Lesions) 

Figure  4  shows  the  validation  ROC  curves  for  the  calcification  data.  Table  1  lists  the 
classification  performances  of  the  four  classifiers,  while  Tables  2  and  3  list  the  two-tailed 
p-values  for  the  pairwise  comparisons  by  AUC  and  pAUC,  respectively.  The  DF-A 
showed  the  best  overall  performance,  with  AUC  =  0.85  ±  0.01 ,  and  the  DF-P  was  slightly 
worse  with  AUC  =  0.82  ±  0.01 .  Both  decision-fusion  ROC  curves  were  well  above  those 
of  the  LDA  and  ANN,  both  in  terms  of  AUC  (p  <  0.0001)  and  pAUC  (p  <  0.02).  None  of 
the  features  were  particularly  strong  by  themselves;  we  ran  an  LDA  on  each  feature 
separately,  yielding  on  average  AUC  =  0.53  ±  0.03,  with  a  maximum  of  AUC  =  0.66  for 
the  best  feature. 

The  DF-P  curve  (pAUC  =  0.38  ±  0.02)  crossed  the  DF-A  curve  (pAUC  =  0.28  ±  0.03)  at 
the  line  TPF  =  0.9.  In  order  to  gain  high-sensitivity  performance,  DF-P  sacrificed 
performance  in  the  less  clinically  relevant  range  of  TPF  <  0.9.  The  DF-A  beat  the  DF-P  in 
terms  of  AUC  (p  =  0.018)  but  lost  in  pAUC  (p  <  0.01).  Both  decision-fusion  classifiers 
greatly  outperformed  the  both  the  ANN  (pAUC  =  0.14  ±  0.02)  and  LDA  (pAUC  =  0.09  ± 
.06)  in  terms  of  pAUC. 

B.  Classifier  Performance  on  Data  Set  M  (Mass  Lesions) 
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Figure  5  shows  the  validation  ROC  curves  of  the  classifiers  for  the  mass  data  set.  Table 
4  lists  the  classification  performances  of  the  four  classifiers,  whereas  Tables  5  and  6  list 
the  p-values  for  the  pairwise  comparisons  by  AUC  and  pAUC,  respectively.  For  this  data 
set,  all  the  classifiers  had  higher  but  very  similar  performance,  with  AUC  ranging  from 
0.93  ±  0.01  (LDA)  to  0.94  ±  0.01  (DF-A).  With  the  exception  of  DF-P  (p  =  0.50),  the  DF-A 
nonetheless  significantly  outperformed  both  the  LDA  (p  =  0.021)  and  the  ANN  (p  = 

0.038)  in  terms  of  AUC.  The  LDA,  ANN,  and  DF-P  curves  were  all  very  similar,  for  both 
AUC  (p  >0.10)  and  pAUC  (p  >  0.10).  Figure  5  (b)  shows  the  ROC  curves  in  the  high 
sensitivity  region  above  the  line  TPF  =  0.90.  The  classifiers’  pAUC  values  ranged 
narrowly  from  0.57  ±  0.07  (ANN)  to  0.67  ±  0.05  (DF-P),  all  close  enough  to  show  no 
statistically  significant  differences  (p  >  0.10).  However,  the  DF-P  did  have  a  higher 
specificity  than  the  LDA  at  both  98%  sensitivity  (0.37  ±  0.10  vs.  0.13  ±  0.13,  p  =  0.04) 
and  at  100%  sensitivity  (0.34  ±  0.08  vs.  0.09  ±0.12,  p  =  0.03).  The  DF-P  curve  passed 
the  DF-A  curve  approximately  at  the  line  TPF  =  0.90  and  yielded  a  slightly  higher  pAUC 
(0.67  ±  0.05  vs.  0.63  ±  0.07),  although  this  improvement  was  not  statistically  significant 
(p  =  0.48). 


IV.  Discussion 

The  multitude  of  medical  data  becoming  available  to  physicians  presents  the  problem  of 
how  best  to  integrate  the  information  for  diagnostic  performance.  Despite  recent 
availability  of  this  information,  current  CADx  programs  for  breast  cancer  tend  to  use  only 
one  type  of  data,  usually  digitized  mammogram  films.  Because  many  clinical  tests 
provide  complementary  information  about  a  disease  state,  it  is  important  to  develop  a 
CADx  system  that  incorporates  data  from  disparate  sources.  However,  combining 
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disparate  data  types  together  for  classification  is  a  difficult  machine-learning  problem. 
This  study  used  the  likelihood-ratio  detector  and  decision-fusion  classifier  to  detect  the 
presence  of  a  malignancy  (a  signal)  within  medical  data  (noisy  features).  We  also 
compared  the  performance  of  this  classifier  to  two  popular  classifiers  in  the  CADx 
literature,  LDA  and  ANN,  and  we  measured  the  diagnostic  performance  with  two 
classification  metrics,  ROC  AUC  and  pAUC.  Finally,  we  performed  these  studies  using 
two  very  different  data  sets  in  order  to  assess  performance  differences  due  to  the  data 
set  itself. 

Data  set  C  (calcification  lesions)  had  a  stronger  nonlinear  component,  indicated  by  the 
fact  that  the  ANN  AUC  was  much  greater  than  the  LDA  AUC.  The  robustness  of  the 
decision-fusion  algorithm  is  evident  in  its  good  performance  on  this  weaker,  nonlinear, 
and  noisy  data  set.  Decision  fusion  significantly  outperformed  the  ANN  and  LDA  on  the 
calcification  data  set  for  both  performance  metrics.  Figure  4  and  Table  1  show  that  the 
biggest  performance  gain  is  in  the  pAUC  metric,  for  which  decision  fusion  doubled  the 
performance  of  the  other  classifiers. 

On  data  set  M  (mass  lesions),  all  four  classifiers  seemed  to  be  saturated  at  a  high  level 
of  performance  in  terms  of  both  AUC  and  pAUC,  as  shown  in  Figure  5  and  Table  4. 
Performances  were  largely  equivalent  across  all  models,  except  for  two  trends.  In  terms 
of  AUC,  the  DF-A  outperformed  both  the  ANN  and  the  LDA  (p  =  0.038  and  0.021, 
respectively).  Although  on  this  data  set  decision  fusion  offered  only  relatively  modest 
gains  in  pAUC,  it  did  achieve  a  significantly  better  specificity  than  the  LDA  at  several  of 
the  highest  sensitivities  of  the  ROC  curve  (p  <  0.05). 
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This  decision-fusion  algorithm  has  many  potential  benefits  over  more  traditional 
classification  algorithms.  Decision  fusion  can  be  optimized  for  any  desired  performance 
metric  by  incorporating  the  metric  into  the  fitness  function  of  the  genetic  algorithm  for  its 
search  over  the  likelihood-ratio  thresholds.  This  advantage  has  important  clinical 
implications,  as  both  the  physician  and  the  CADx  algorithm  are  constrained  to  operate  at 
high  sensitivity.  The  performance  metric  can  emphasize  good  performance  at  high 
sensitivities  and  deemphasize  performance  at  clinically  unacceptable  low  sensitivities. 
Therefore  we  expect  the  DF-A  curve  to  maximize  AUC  and  the  DF-P  curve  to  maximize 
pAUC.  The  DF-P  curve  should  fall  under  the  DF-A  curve  for  low  FPF  values  but  should 
cross  the  DF-A  curve  at  the  line  TPF=0.90  to  capture  a  greater  pAUC  value.  Figures  4 
and  5  show  evidence  that  the  DF-P  did  optimize  pAUC.  The  DF-P  ROC  curves  crossed 
the  DF-A  curves  at  the  line  TPF  =  0.90  and  do  in  fact  have  a  larger  pAUC  value  than  the 
DF-A  curves.  Another  advantage  is  that  decision  fusion  is  robust  and  can  recover  from 
noisy,  weak  features.  The  likelihood-ratio  classifier  passes  information  about  the  strength 
or  weakness  of  a  feature  to  the  decision  fuser,  which  adjusts  the  influence  given  to  that 
feature.  This  feature-strength  information  is  the  ROC  operating  point  (sensitivity  and 
specificity)  determined  by  the  likelihood-ratio  threshold  that  was  found  by  the  genetic 
algorithm  search.  Figure  3  shows  a  schematic  of  this  information  flow  from  the  individual 
features  to  the  decision  fuser.  The  robustness  of  the  algorithm  also  suggests  that 
decision  fusion  may  be  able  to  reach  the  asymptotic  validation  performance  with  fewer 
data.  This  is  important  for  most  medical  researchers  who  are  starting  to  collect  new 
databases  and  for  any  databases  that  are  expensive  to  collect.  Because  our  decision- 
fusion  technique  needs  to  estimate  only  one-dimensional  PDFs,  which  require  much 
fewer  data  points  than  multidimensional  PDFs,  decision  fusion  needs  many  fewer  data 
points  for  training.  For  this  reason,  the  decision-fusion  algorithm  may  be  able  to  handle 
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typical  clinical  data  sets  with  missing  data,  as  shown  in  previous  work  with  decision 
fusion 

Drawbacks  of  the  decision-fusion  algorithm  include  losing  potentially  useful  feature 
information  by  reducing  the  likelihood-ratio  values  of  the  features  to  a  binary  value. 
Although  the  algorithm  loses  some  feature  information  in  this  step,  it  recovers  by 
optimally  fusing  the  remaining  binary  feature  information  from  that  point  forward.  In  the 
ideal  case,  if  the  true  underlying  multivariate  distribution  of  the  data  happens  to  be 
known  or  can  be  estimated  with  a  high  degree  of  confidence,  then  the  Bayes  classifier 
can  take  this  information  into  account  and  is  theoretically  optimal.  However,  since  the 
true  underlying  distribution  is  almost  never  known  in  practice,  decision  fusion  is  a  good 
alternative  method,  especially  for  small  and  noisy  data  sets. 

V.  Conclusions 

We  have  developed  a  decision-fusion  classification  technique  that  combines  features 
from  heterogeneous  data  sources.  We  have  demonstrated  the  technique  on  both  a  data 
set  of  two  different  breast  imaging  modalities  and  a  data  set  of  human-extraced  versus 
computer-extracted  findings.  With  our  data,  decision  fusion  always  performed  as  well  as 
or  better  than  the  classic  classification  techniques  LDA  and  ANN.  The  improvements 
were  all  significant  for  the  more  challenging  data  set  C,  but  not  always  significant  for  the 
less  challenging  data  set  M.  Such  a  statement  may  not  reflect  the  full  diversity  of  these 
data  sets,  which  differ  in  many  respects,  including  linear  separability,  numbers  of  cases 
and  features,  and  feature  correlations.  Future  work  will  explore  the  contribution  of  such 
factors  in  order  to  understand  the  full  potential  and  limitations  of  the  decision-fusion 
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technique.  In  conclusion,  the  decision-fusion  technique  showed  particular  strength  in  the 
task  of  combining  groups  of  weak,  noisy  features  for  classification. 
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Table  Legends 


Table  1.  Classifier  Performance  on  Data  Set  C  (Calcification  Lesions) 

The  table  shows  the  AUC  and  pAUC  values  for  the  ROC  curves  of  the  four  classifiers 
under  100-fold  cross  validation.  The  performance  values  exhibited  a  wide  range.  The 
DF-A  scored  the  best  for  AUC,  while  DF-P  scored  highest  for  pAUC,  as  expected.  The 
decision  fusion  curves  soundly  outperformed  both  the  ANN  and  LDA  in  terms  of  pAUC. 

Table  2.  P-values  for  AUC  Comparisons  for  Data  Set  C  (Calcification  Lesions) 

The  confusion  matrix  shows  the  p-values  for  the  pairwise  comparisons  of  the  classifiers’ 
AUC  values.  All  pairwise  comparisons  were  statistically  significant. 

Table  3.  P-values  for  pAUC  Comparisons  for  Data  Set  C  (Calcification  Lesions) 

The  confusion  matrix  shows  the  p-values  for  the  pairwise  comparisons  of  the  classifiers’ 
pAUC  values.  All  pairwise  comparisons  were  statistically  significant. 

Table  4.  Classifier  Performance  on  Data  Set  M  (Mass  Lesions) 

The  table  shows  the  AUC  and  pAUC  values  for  the  ROC  curves  of  the  four  classifiers 
under  100-fold  cross  validation.  All  four  classifiers  performed  very  similarly  on  this  data 
set.  The  DF-A  scored  the  best  for  AUC,  whereas  the  DF-P  scored  highest  for  pAUC, 
although  both  were  still  within  one  standard  deviation  of  each  of  the  other  classifiers’ 
performances. 

Table  5.  P-values  for  AUC  Comparisons  for  Data  Set  M  (Mass  Lesions) 
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The  confusion  matrix  shows  the  p-values  for  the  pairwise  comparisons  of  the  classifiers’ 
AUC  values.  The  DF-A  outperformed  the  ANN  and  LDA.  Among  the  DF-P,  ANN,  and 
LDA,  there  were  no  statistically  significant  pAUC  differences. 

Table  6.  P-values  for  pAUC  Comparisons  for  Data  Set  M  (Mass  Lesions) 

The  confusion  matrix  shows  the  p-values  for  the  pairwise  comparisons  of  the  classifiers’ 
pAUC  values.  None  of  the  pAUC  comparisons  were  statistically  significant.  Although 
pAUC  scores  were  similar,  the  DF-P  did  have  a  higher  specificity  than  the  LDA  at  both 
98%  sensitivity  (0.37  ±  0.10  vs.  0.13  ±  0.13,  p  =  0.04)  and  at  100%  sensitivity  (0.34  ± 
0.08  vs.  0.09  ±  0.12,  p  =  0.03). 
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Tables 


Table  1.  Classifier  Performance  on  Calcification  Data  Set  C 


Classifier 

AUC 

pAUC 

DF-A 

0.85  ±0.01 

0.28  ±  0.03 

DF-P 

0.82  ±0.01 

0.38  ±  0.02 

ANN 

0.76  ±0.01 

0.14  ±  0.02 

LDA 

0.68  ±0.01 

0.09  ±0.06 
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Table  2.  P-values  for  AUC  Comparisons  for  Calcification  Data  Set  C 


DF-A  DF-P  ANN  LDA 

DF-A 
DF-P 
ANN 
LDA 


0.018 

<0.0001 

<0.0001 

0.0001 

<0.0001 

<0.0001 
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Table  3.  P-values  for  pAUC  Comparisons  for  Calcification  Data  Set  C 


DF-A  DF-P  ANN  LDA 

DF-A 
DF-P 
ANN 
LDA 


0.0084 

0.018 

<0.0001 

0.0001 

<0.0001 

0.016 
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Table  4.  Classifier  Performance  on  Mass  Data  Set  M 


Classifier 

AUC 

pAUC 

DF-A 

0.94  ±0.01 

0.63  ±  0.07 

DF-P 

0.93  ±0.01 

0.67  ±0.05 

ANN 

0.93  ±0.01 

0.57  ±0.07 

LDA 

0.93  ±0.01 

0.59  ±0.06 
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Table  5.  P-values  for  AUC  Comparisons  for  Mass  Data  Set  M 


DF-A  DF-P  ANN  LDA 

DF-A 
DF-P 
ANN 
LDA 


0.50 

0.038 

0.021 

0.20 

0.17 

0.53 
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Table  6.  P-values  for  pAUC  Comparisons  for  Mass  Data  Set  M 


DF-A  DF-P  ANN  LDA 

DF-A 
DF-P 
ANN 
LDA 


0.48 

0.45 

0.27 

0.14 

0.12 

0.46 
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Figure  Legends 


Figure  1.  Feature  Group  Structure  for  Calcification  Data  Set  C  (Calcification  Lesions) 

The  features  of  the  calcification  data  set  consisted  of  three  main  groups:  computer- 
extracted  features,  radiologist-extracted  features,  and  patient  history  features.  The 
computer-extracted  features  were  morphological  and  shape  features  of  the  automatically 
detected  and  segmented  microcalcification  clusters  within  the  digitized  mammogram 
images.  The  radiologist-extracted  features  comprised  both  radiologist-interpreted 
findings  and  BI-RADS  features.  This  data  set  consisted  of  512x512  pixel  ROIs  of  all 
1508  calcification  lesions  in  the  Digital  Database  for  Screening  Mammography  (DDSM). 
This  data  set  had  many  heterogenic  characteristics,  such  as  that  it  was  collected  at  four 
different  institutions,  scanned  on  four  digitizers  with  different  noise  characteristics,  and 
included  both  human-extracted  and  computer-extracted  features,  such  as  shape  and 
texture  features. 

Figure  2.  Feature  Group  Structure  for  Mass  Data  Set  M  (Mass  Lesions) 

The  features  of  the  mass  data  set  consisted  of  mammogram  features,  sonogram 
features,  and  patient  history  features.  The  mammogram  features  comprised  both  BI¬ 
RADS  features  and  radiologist-interpreted  findings.  The  sonogram  features  consisted  of 
ultrasound  BI-RADS  features,  Stavros  features,  and  other  ultrasound  mass  descriptors. 
All  image  features  were  radiologist-extracted  features.  The  mass  data  set  was 
heterogeneous  in  including  both  mammogram  and  sonogram  views  of  the  breast.  Both 
mammogram  and  sonogram  feature  sets  were  as  well  as  including  patient  history 
features. 
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Figure  3.  The  Role  of  Likelihood-ratio  Thresholds  for  Decision  Fusion 
The  first  column  shows  plots  of  the  log-likelihood-ratio  vs.  feature  value  for  each  feature. 
The  algorithm  calculated  the  likelihood  ratio  and  then  thresholded  it  separately  for  each 
feature.  The  threshold  determined  the  ROC  operating  point  of  the  likelihood-ratio 
classifier  of  a  particular  feature.  Next,  the  algorithm  combined  the  binary  decisions  from 
the  feature-level  likelihood  ratio  classifiers  using  decision  fusion  theory  to  produce  the 
likelihood  ratio  of  the  fused  classifier. 

Figure  4.  ROC  Curves  for  Data  Set  C  (Calcification  Lesions) 

The  classifiers’  ROC  curves  for  100-fold  cross  validation  are  shown.  Figure  2  (a)  shows 
the  full  ROC  curves,  while  Figure  2  (b)  shows  only  the  high-sensitivity  region  (TPF  s 
0.90).  For  the  calcification  data  set,  the  four  classifiers  yielded  differing  classification 
performance  under  100-fold  cross  validation.  Both  decision-fusion  curves  lay  significantly 
above  the  LDA  and  ANN  curves,  both  in  terms  of  AUC  and  pAUC.  As  expected,  the 
decision-fusion  classifiers  achieved  the  highest  scores  of  all  the  classifiers  for  their  target 
performance  metrics;  DF-A  attained  the  greatest  AUC,  whereas  DF-P  attained  the 
greatest  pAUC.  The  DF-P  curve  surpassed  the  DF-A  curve  and  dominated  the  other 
curves  above  the  line  TPF  =  0.90.  In  order  to  gain  high-sensitivity  performance,  DF-P 
sacrificed  performance  in  the  less  clinically  relevant  range  of  TPF  <  0.90. 

Figure  5.  ROC  Curves  for  Data  Set  M  (Mass  Lesions) 

For  the  mass  data  set,  all  classifiers  had  high  levels  of  classification  performance.  The 
DF-A  and  DF-P  achieved  the  highest  AUC  and  pAUC,  respectively.  In  terms  of  AUC,  the 
DF-A  outperformed  both  the  ANN  and  LDA  (p  =  0.038  and  0.021,  respectively).  In  Figure 
5  (b),  the  DF-P  curve  had  slightly  more  partial  area  than  the  other  curves.  Despite  having 


29 


statistically  equivalent  partial  areas,  the  DF-P  had  a  greater  specificity  than  the  LDA  at 
high  sensitivities  TPF  =  0.98  (p  =  0.03). 
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Figures 


Figure  1.  Feature  Group  Structure  for  Calcification  Data  Set  C  (Calcification  Lesions) 


I  Patient  history  \_ 
I  features  (1)  J 


Patient  age 


31 


Figure  2.  Feature  Group  Structure  for  Mass  Data  Set  M  (Mass  Lesions) 
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Figure  3.  The  Role  of  Likelihood-ratio  Thresholds  for  Decision  Fusion 
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Figure  4.  ROC  Curves  for  Data  Set  C  (Calcification  Lesions) 


ROC  Curves  forthe  Calcification  Data  Set 


(a)  ROC  curves 


Partial  ROC  Curves  forthe  Calcification  Data  Set 


(b)  Partial  ROC  curves  (TPF  >  0.90) 


34 


TPF 


Figure  5.  ROC  Curves  for  Data  Set  M  (Mass  Lesions) 


ROC  Curves  forthe  Mass  Data  Set 


(a)  ROC  curves 


Partial  ROC  Curves  forthe  Mass  Data  Set 


(b)  Partial  ROC  curves  (TPF  >  0.90) 
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Abbreviations 


ANN  Artificial  Neural  Network 

AUC  Area  Under  the  ROC  curve 

CAD  Computer-aided  Detection 

CADx  Computer-aided  Diagnosis 

DDSM  Digital  Database  for  Screening  Mammography 

DF-A  AUC-optimized  Decision  Fusion 

DF-P  pAUC-optimized  Decision  Fusion 

FPF  False  Positive  Fraction 

LDA  Linear  Disciminant  Analysis 

pAUC  Partial  Area  Under  the  ROC  curve  (TPF  s  0.90) 

Pd  Probability  of  Detection 

Pf  Probability  of  False  Alarm 

ROC  Receiver  Operating  Characteristic 

SOLD  Spatial  Gray  Level  Dependence 

TPF  True  Positive  Fraction 
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Abstract 


Purpose:  To  develop  computer-aided  diagnosis  (CADx)  models  using  both  mammographic  and 
sonographic  descriptors  and  to  estimate  the  generalization  performance  of  these  models  on 
future  cases. 

Materials  and  Methods:  Institutional  Review  Board  approval  was  obtained  for  this  HIPPA- 
compliant  study.  Mammographic  and  sonographic  exams  were  performed  on  737  patients, 
yielding  803  breast  mass  lesions  (296  malignant,  507  benign).  Radiologist-interpreted  features 
from  the  mammograms  and  sonograms  were  used  as  input  features  by  a  linear  discriminant 
analysis  (LDA)  and  an  artificial  neural  network  (ANN)  to  differentiate  benign  from  malignant 
lesions.  An  LDA  using  all  the  features  was  compared  to  an  LDA  using  only  stepwise-selected 
features.  Classification  performances  were  quantified  using  receiver  operating  characteristic 
(ROC)  analysis  and  were  evaluated  in  a  train,  validate,  and  retest  scheme.  On  the  retest  set,  both 
LDAs  were  compared  to  the  radiologists’  overall  assessment  score  of  malignancy. 

Results:  Both  the  LDA  and  ANN  achieved  high  classification  performance  with  cross-validation 
(AUC  =  0.92  ±  0.01  and  o.goAUC  =  0.54  ±  0.08  for  the  LDA,  AUC  =  0.92  ±  0.01  and  o.goAUC  =  0.55 
±  0.08  for  the  ANN).  Both  models  also  generalized  very  well  to  the  re-test  set,  with  no  statistically 
significant  performance  differences  between  the  validate  and  retest  sets  (p  >  0.1).  On  the  retest 
set,  there  were  also  no  statistically  significant  performance  differences  between  the  LDA  using  all 
features  and  using  only  the  stepwise  selected  features  (p  >  0.3)  and  between  either  LDA  and  the 
radiologists’  assessment  score  (p  >  0.2). 

Conclusion:  The  results  showed  that  combining  mammographic  and  sonographic  descriptors  in 
a  CADx  model  can  result  in  high  classification  and  generalization  performance.  On  the  retest  set, 
the  LDA  matched  the  radiologists’  classification  performance. 


Introduction 


Although  mammography  is  the  only  modality  proven  to  reduce  the  mortality  due  to  breast  cancer, 
it  has  a  low  specificity  for  benign  lesions.  Because  of  mammography’s  low  specificity,  many 
women  undergo  unnecessary  breast  biopsies.  As  many  as  65-85%  of  breast  biopsies  are 
performed  on  benign  lesions  (1-3).  Not  only  does  unnecessary  biopsy  increase  the  cost  of 
mammographic  screening  (4),  but  it  also  subjects  patients  to  avoidable  emotional  and  physical 
burdens. 

To  improve  the  accuracy  of  mammography,  researchers  have  used  computer  aids  to  help 
radiologists  detect  (5-7)  and  diagnose  (8-11)  suspicious  breast  lesions.  Some  studies  have 
shown  that  such  computer-aided  diagnosis  (CADx)  systems  have  increased  the  overall  diagnostic 
sensitivity  and  specificity.  Lesions  determined  to  be  very  likely  benign  may  be  recommended  for 
short-term  follow-up  rather  than  biopsy  (12,  13). 

CADx  models  often  use  breast  morphology  descriptors  of  the  Breast  Imaging  Reporting  and  Data 
System  (BI-RADS)  lexicon.  BI-RADS  was  developed  by  the  American  College  of  Radiology 
(ACR)  to  standardize  the  interpretation  of  mammograms  (14-17).  Originally  BI-RADS  was  applied 
to  only  mammography,  but  the  crucial  adjunct  role  of  sonography  has  recently  led  the  ACR  to 
develop  a  BI-RADS  lexicon  for  breast  sonography  as  well.  Sonographic  BI-RADS  is  a  useful  tool 
to  help  standardize  the  characterization  of  sonographic  lesions  (17,  18)  and  facilitate  clinician 
communication. 

Currently,  the  primary  clinical  role  for  sonography  is  to  aid  in  distinguishing  simple  cysts  from 
solid  masses,  as  well  as  to  direct  aspirations,  wire  localizations,  and  ultrasound  guided  biopsies. 
More  recently,  several  authors  have  investigated  the  role  of  sonography  in  helping  to  differentiate 
malignant  from  benign  breast  lesions  (19-23).  There  have  also  been  many  computer-aided 


diagnosis  studies  in  breast  sonography,  which  are  based  upon  image  features  automatically 
extracted  by  computer  vision  algorithms  (24-32).  To  the  best  of  our  knowledge,  there  has  not  yet 
been  a  study  using  the  standardized  BI-RADS  sonographic  findings  as  the  basis  of  a  predictive 
model,  nor  to  combine  the  use  of  BI-RADS  mammographic  and  sonographic  findings  for  that 
purpose. 

A  previous  study  (33)  assessed  the  positive  predictive  value  (PPV)  and  negative  predictive  value 
(NPV)  of  the  individual  sonographic  BI-RADS  features.  This  study  extends  previous  work  by  using 
a  larger  database  of  mass  lesions  and  by  developing  and  evaluating  decision  models  based  upon 
the  BI-RADS  features,  both  mammographic  and  sonographic. 

Materials  and  Methods 

Patient  Population 

The  cases  for  analysis  in  this  study  were  an  extension  of  the  data  set  described  in  detail  in  a 
previous  study  (33).  The  cases  were  collected  between  2000  and  2005  at  our  institution.  The  data 
set  included  803  lesions,  of  which  296  were  malignant  and  507  were  benign,  and  389  were 
palpable  and  414  nonpalpable.  The  patient  ages  ranged  from  17  to  87  years,  with  a  median  age 
of  50  years.  The  same  inclusion  and  exclusion  guidelines  as  described  previously  (33)  applied  to 
this  data  set.  Institutional  review  board  approval  was  obtained  for  this  retrospective  study 
including  a  waiver  of  informed  consent.  Cases  for  analysis  in  this  study  were  selected  from  those 
recommended  for  biopsy  and  were  included  in  the  study  if  the  lesions  corresponded  to  solid 
masses  on  sonography  and  if  both  mammographic  and  sonographic  films  taken  before  the  biopsy 
were  available  for  review. 


Features  Used 


All  patients  underwent  both  mammography  and  sonography.  The  mammographic  exam  consisted 
of  both  craniocaudal  and  mediolateral-oblique  views,  with  additional  true  lateral  and  spot 
compression  magnification  in  almost  all  cases.  Sonographic  images  were  acquired  in  both  radial 
and  antiradial  projections  with  and  without  caliper  measurements.  Additional  gray-scale  images 
were  obtained  in  almost  all  cases  to  better  show  the  lesion.  Doppler,  color  Doppler,  and  power 
Doppler  images  were  not  part  of  the  routine  imaging  protocol  but  were  provided  for  review  when 
available.  One  of  four  dedicated  breast  radiologists  with  6-1 1  years  of  experience  used  BI-RADS 
lexicon  descriptors  to  describe  the  lesions,  as  described  previously  (33).  Information  about  the 
patient’s  age,  physical  examination  findings,  family  history  of  breast  cancer,  and  personal  history 
of  breast  malignancy  was  available  to  each  radiologist  to  most  accurately  reproduce  a  realistic 
clinical  situation.  The  radiologist  was  blinded  to  the  histologic  diagnosis  during  the  evaluation.  Of 
the  total  37  features,  13  were  mammographic  BI-RADS,  13  were  sonographic  BI-RADS  features, 
4  were  ultrasound  features  suggested  by  Stavros  etal.  (19),  4  were  other  ultrasound  features, 
and  3  were  patient  history  features.  The  13  mammographic  BI-RADS  features  were  mass  size, 
parenchyma  density,  mass  margin,  mass  shape,  mass  density,  calcification  number  of  particles, 
calcification  distribution,  calcification  description,  architectural  distortion,  associated  findings, 
special  cases,  comparison  with  prior,  and  mass  size.  The  13  sonographic  BI-RADS  features  were 
mass  shape,  mass  orientation,  mass  margin,  posterior  acoustic  features,  radial  diameter, 
antiradial  diameter,  anterior-posterior  diameter,  calcifications  within  mass,  echo  texture,  lesion 
boundary,  echo  pattern,  special  cases,  and  vascularity.  The  five  features  suggested  by  Stavros  et 
al  were  mass  shape,  mass  margin,  acoustic  transmission,  thin  echo  pesudocapsule,  and  mass 
echogenicity.  The  four  other  sonographic  mass  descriptors  were  edge  shadow,  cystic  component, 
and  two  mammographic  BI-RADS  descriptors  applied  to  ultrasound:  mass  shape  and  mass 
margin.  The  three  patient  history  features  were  patient  age,  family  history,  and  indication  for 


ultrasound. 


In  addition  to  the  BI-RADS  and  Stavros  descriptors,  the  radiologists  also  recorded  their 
assessment  about  the  malignancy  of  the  lesion  as  an  integer  ranging  from  0  for  unquestionably 
benign  to  100  for  unquestionably  malignant.  The  gut  assessment  rating  was  not  used  as  an  input 
to  the  CADx  models,  but  rather  as  a  comparison  to  the  models’  output  for  classification 
performance. 

Predictive  Modeling  and  Sampling 

For  models  in  this  study,  we  used  both  linear  discriminant  analysis  (LDA)  and  artificial  neural 
networks  (ANNs).  The  LDA  was  a  Fisher’s  linear  discriminant.  The  ANNs  were  three-layer  (one 
hidden  layer),  feed-forward,  and  error  back-propagation  artificial  neural  networks.  These  are  the 
most  popular  methods  used  in  many  previous  studies  by  our  group  as  well  as  the  rest  of  the  field. 

In  order  to  assess  the  usefulness  and  risk  of  using  computer-aided  diagnosis  (CADx)  models  in 
the  clinic,  it  is  crucial  to  have  a  good  estimate  of  their  performance  on  future  cases  (or 
generalization).  For  limited  data  and  more  complicated  models,  the  traditional  method  of  cross- 
validation  could  still  pose  a  danger  of  optimistically  biasing  the  testing  performance;  it  is  common 
to  optimize  certain  global  parameters  (such  as  feature  selection  for  the  LDA  or  the  number  of 
hidden  nodes  of  the  ANN)  to  maximize  cross-validation  performance.  With  cross-validation  the 
scientist  is  able  to  use  knowledge  of  all  the  data  to  make  modeling  decisions,  whereas  with 
generalization  such  information  is  not  available  for  yet  unseen  future  cases.  Therefore  optimizing 
the  models  for  cross-validation  performance  could  lead  to  reduced  generalization  performance. 

In  order  to  avoid  these  overfitting  pitfalls  and  to  better  estimate  generalization  ability  of  each 
model,  we  used  a  train,  validate,  and  retest  scheme.  In  this  scheme  the  data  set  is  divided  into 
sets:  a  train/validate  set  and  a  retest  set.  The  retest  set  is  held  aside  until  after  the  models  are 
finalized,  as  not  to  influence  any  of  the  modeling  process.  All  modeling  decisions  are  made  only 
on  the  train/validate  set.  The  model  parameters  are  optimized  to  maximize  cross  validation  on  the 


train/validate  set.  Once  the  moders  parameter  values  are  set,  the  model  is  then  trained  on  the 
entire  train/validate  set.  The  trained  model  is  then  applied  to  the  retest  set. 

In  particular,  for  our  dataset  of  803  lesions,  we  chose  the  first  500  cases  in  chronological  order  for 
the  train/validate  set  and  the  remaining  303  cases  for  the  retest  set.  We  chose  the  ANN’s 
architecture  and  parameter  settings  to  optimize  its  cross-validation  performance  on  the 
train/validate  set.  Once  the  modeling  decisions  had  been  made,  we  trained  the  LDA  and  ANN  on 
all  the  cases  in  the  train/validate  set  to  determine  a  single,  final  set  of  weights,  which  were  then 
applied  to  the  retest  set. 

Classifier  Performance  Evaluation 

To  use  the  LDA  or  ANN  model  as  a  diagnostic  aide,  one  could  select  a  threshold  value,  so  that 
cases  with  output  values  below  the  threshold  would  be  considered  very  likely  benign  and 
therefore  candidates  for  follow-up  rather  than  biopsy.  Those  cases  with  model  outputs  greater 
than  the  threshold  would  be  considered  suspicious  for  malignancy  and  recommended  for  biopsy. 
Varying  the  threshold  value  results  in  a  tradeoff  between  sensitivity  and  specificity.  The  entire 
range  of  sensitivity  and  specificity  values  for  a  classifier  is  illustrated  by  the  receiver  operating 
characteristic  (ROC)  curve  (34,  35).  In  order  to  quantify  a  classifier’s  performance,  we  used  the 
following  five  summary  measures  of  the  ROC  curve:  area  under  the  ROC  curve  (AUC),  the  partial 
area,  (o.goAUC),  as  well  as  the  specificity,  positive  predictive  value  (PPV),  and  negative  predictive 
value  (NPV)  for  a  given  sensitivity  level.  The  AUC  represents  the  average  specificity  over  all 
sensitivities  and  ranges  from  0.5  (chance  performance)  to  1.0  (perfect  performance).  Since  high 
sensitivity  is  essential  for  a  classification  task,  a  more  relevant  performance  measure  is  the 
o.goAUC,  which  represents  the  average  specificity  performance  of  the  classifier  at  sensitivities  from 
90%  to  100%.  Whereas  the  two  previous  measures  provide  an  overall  summary  of  performance, 
the  remaining  three  are  clinically  relevant  measures  corresponding  to  a  single  threshold  value. 


which  for  breast  cancer  applications  is  usually  chosen  to  deliver  nearly  perfect  sensitivity  such  as 
98%  (36,  37). 

Results 

Generalization  between  Validating  and  Retesting 

Table  1  shows  the  LDA  performances  with  both  100-fold  cross-validation  on  the  train/validate  set 
and  retest  performance  on  the  retest  set.  The  LDA  achieved  high  classification  performance,  with 
AUC  =  0.92  ±  0.01  and  o.goAUC  =  0.54  ±  0.08  on  the  validate  set  and  AUC  =  0.92  ±  0.02  and 
ogoAUC  =  0.52  ±  0.08  on  the  retest  set.  The  LDA  generalized  well;  there  were  no  statistically 
significant  differences  between  the  performance  metrics  of  the  validate  set  and  those  of  the  retest 
set  (p  >  0.10).  In  addition  to  the  entire  ROC  curves  of  the  LDA  performance,  individual  thresholds 
also  generalized  very  well.  Table  2  shows  that  the  same  threshold  value  determined  very  similar 
true-positive  fraction  (sensitivity)  and  false-positive  fraction  (1 -specificity)  operating  points  in  the 
high-sensitivity  region  on  both  ROC  curves. 

The  ANN  also  performed  very  well,  achieving  AUC  =  0.92  ±  0.01  and  o.goAUC  =  0.55  ±  0.08  on  the 
validate  set  and  AUC  =  0.91  ±  0.02  and  o.goAUC  =  0.57  ±  0.06  on  the  retest  set.  The  ANN 
performed  comparably  on  the  validate  and  retest  set,  with  no  significant  differences  in  either 
metric  (p  >  0.10). 

Comparison  of  LDA  and  ANN  Performances 

The  two  types  of  models,  LDA  and  ANN,  had  very  similar  performances  on  both  the  validation 
and  retest  sets;  the  differences  were  not  statistically  significant  (p  >  0.10).  In  the  interest  of 
brevity,  the  ANN  performance  tables  are  not  shown  because  they  show  very  similar  trends  as  the 


LDA  performance  tables. 


Figure  1  depicts  the  four  models’  good  generalization  performance  graphically.  The  ROC  curves 
for  the  LDA  and  ANN  in  both  testing  paradigms  appear  in  Figure  1 .  Figure  1  (a)  shows  the  entire 
ROC  curves,  while  1  (b)  shows  only  the  high-sensitivity  region  (TPF  >  0.90)  of  those  curves.  The 
discrepancies  among  the  curves  were  very  minor,  and  the  curves  overlap  each  other.  The 
similarity  of  the  ROC  curves  showed  that  all  four  had  essentially  indistinguishable  classification 
performance.  Figure  1  shows  good  evidence  of  generalization  for  the  LDA  and  ANN  because 
there  was  no  performance  drop  from  the  validation  curves  to  the  retest  curves. 

Feature  Selection  and  Generalization  of  Simplified  Model 

For  the  LDA,  we  also  performed  a  stepwise  feature  selection,  which  chose  the  following  14 
features;  patient  age,  calcification  distribution,  calcification  description,  associated  findings, 
comparison  with  prior,  anterior-posterior  diameter,  indication  for  ultrasound,  Stavros  mass  shape, 
BI-RADS  mass  margin,  edge  shadow,  cystic  component,  ultrasound  lesion  boundary,  surrounding 
tissue  effects,  and  ultrasound  special  findings.  Feature  selection  was  done  using  the  validate  set 
only.  On  the  retest  set,  an  LDA  using  only  these  stepwise-selected  features  performed 
comparably  with  no  significant  difference  compared  to  the  LDA  using  all  the  features  (AUC  = 
0.92±0.02  vs.  0.91  ±0.02,  p  >  0.3).  The  full  performance  table  for  the  LDA  with  the  stepwise- 
selected  features  is  not  shown  due  to  its  close  similarity  to  the  table  of  the  fully  featured  LDA. 

Comparing  the  LDA  to  the  Radiologists’  Assessment  of  Malignancy 

Table  3  compares  the  retest  performance  of  the  LDA  against  the  radiologists’  assessment  rating 
on  the  retest  set.  Like  the  LDA,  the  radiologists’  gut  assessment  also  achieved  high  classification 
performance  on  the  retest  set,  with  AUC  =  0.92  ±  0.02  and  o.goAUC  =  0.52  ±  0.06  on  the  retest 
set.  There  were  no  statistically  significant  differences  in  any  of  the  performance  metrics  of  the 
LDA  and  radiologists’  overall  gut  assessment  (p  >  0.2).  For  example,  on  this  retest  data  set  the 


LDA  and  radiologists  performed  with  very  similar  NPV  values  (97±1%  versus  98±1%,  p  =  0.25). 


Figure  2  shows  the  ROC  curves  for  the  LDA  with  all  features,  the  LDA  with  the  stepwise-selected 
features,  and  the  radiologists’  assessment  of  malignancy.  There  were  no  statistically  significant 
differences  in  any  of  the  performance  metrics  among  the  three  ROC  curves  (p  >  0.2).  Although 
the  radiologist  curve  crossed  over  the  LDA  curves  several  times,  even  at  the  points  of  greater 
divergence,  the  differences  were  not  statistically  significant  (p  >  0.2). 

Figure  3  depicts  the  histograms  of  the  LDA  output  (Fig.  3  a)  and  radiologists’  gut  assessment 
(Fig.  3  b)  values  for  the  retest  set.  The  histograms  show  the  distinction  in  the  output  distributions 
between  the  benign  and  malignant  lesions.  The  values  for  the  benign  lesions  tended  to  fall  on  the 
left  of  the  histogram  plot  with  values  around  zero.  Those  for  the  malignant  lesions  were 
concentrated  on  the  right  of  the  plots,  around  one  for  the  LDA  and  100  for  the  radiologists’ 
assessment  values.  There  were  few  values  in  the  center  regions,  compared  to  those  on  the 
extremes. 

Example  patient  cases  are  presented  in  Figures  4  through  6  to  illustrate  situations  where 
radiologists  and  computer  models  agree  as  well  as  disagree.  Shown  in  Figure  4,  Patient  1 
presented  with  a  well-defined,  oval,  well-circumscribed  mass,  which  indicated  a  benign  lesion. 

The  histopathology  result  indicated  fibroadenoma.  Both  the  LDA  and  radiologist  considered  this 
case  very  benign,  giving  scores  of  0.02/1 .00  and  0/100,  respectively.  Shown  in  Figure  5,  Patient  2 
presented  with  a  mass  with  irregular  shape,  indistinct  margin,  and  shadowing  with  echogenic 
tails.  Histopathologic  diagnosis  indicated  that  this  lesion  was  invasive  ductal  carcinoma.  Both  the 
LDA  and  radiologist  considered  this  case  very  malignant,  with  scores  of  0.99/1 .00  and  95/100, 
respectively.  Shown  in  Figure  6,  Patient  3  presented  with  a  mass  with  an  ill-defined  margin  in  the 
mammogram.  In  the  ultrasound  image  the  lesion  appeared  circumscribed  and  oval  with  thick 
margins.  Histopathologic  diagnosis  indicated  that  this  lesion  was  necrotic  breast  tissue.  Although 
some  necroses  could  indicate  malignancy,  follow-up  exams  have  shown  that  cancer  has  not 
appeared  in  this  patient  since  biopsy  two  years  ago.  The  LDA  considered  this  case  relatively 


benign  with  a  score  of  0.33/1 .00,  whereas  the  radiologist  considered  it  more  indicative  of 
malignancy  with  a  score  of  85/100. 

Discussion 

Previous  studies  have  shown  that  BI-RADS  descriptors  for  both  mammography  (4,  38-41)  and 
sonography  (19,  20,  42)  are  useful  in  predicting  the  likelihood  of  breast  cancer.  A  previous  study 
(33)  showed  that  mammographic  and  sonographic  BI-RADS  features  as  well  as  Stavros 
ultrasound  features  (19)  could  differentiate  malignant  from  benign  breast  masses  with  high 
statistical  significance.  Both  mammographic  (43,  44),  and  sonographic  (25,  27,  45)  features  have 
been  useful  in  breast  cancer  computer-aided  diagnosis  (CADx)  systems  as  well.  Whereas 
previous  studies  have  used  other  features  extracted  from  the  sonogram  image,  to  the  best  of  our 
knowledge  this  current  study  is  the  first  CADx  study  not  only  to  use  sonographic  BI-RADS 
features  but  also  the  first  to  combine  BI-RADS  of  ultrasound  and  of  mammography. 

In  order  to  justify  the  clinical  use  of  a  CADx  system  on  new  cases,  it  is  important  to  estimate  its 
generalization  performance.  We  have  estimated  the  generalization  performance  of  both  an  LDA 
and  an  ANN  on  our  data  set  by  using  a  train-validate-retest  testing  scheme  on  our  data  set.  This 
is  a  more  rigorous  standard  than  most  studies  that  rely  upon  train-validate  only,  also  known  as 
cross-validation. 

The  LDA  and  ANN  had  virtually  indistinguishable  classification  performance,  which  indicated  that 
the  BI-RADS  data  were  highly  linear.  In  general,  such  results  would  support  the  use  of  the  LDA 
model,  which  is  simpler  than  the  nonlinear  ANN  and  therefore  less  likely  to  be  susceptible  to 
overtraining  problems.  In  this  study,  however,  it  was  demonstrated  that  there  were  no  problems 
with  overtraining,  as  both  models  performed  very  similarly  during  the  retesting  phase. 


In  addition  to  the  whole  ROC  curve,  it  is  important  to  consider  more  clinically  relevant  threshold 
values  in  determining  the  generalization  and  stability  of  a  CADx  system.  Since  CADx  systems 
typically  give  as  output  a  range  of  values,  applying  a  certain  threshold  to  the  output  determines 
the  operating  point  (sensitivity  and  specificity  settings)  at  which  the  clinical  decision  is  made. 
Knowing  the  CADx  operating  point  helps  the  clinician  to  incorporate  it  into  an  overall  diagnostic 
decision.  Table  2  showed  that  the  LDA  thresholds  from  the  validation  ROC  curve  generalized 
very  well  to  the  retest  ROC  curve  in  the  clinically  important  high-sensitivity  region.  The  threshold 
stability  suggests  that  these  threshold  values  could  be  used  clinically  with  the  LDA  on  future 
cases. 

Because  the  task  of  collecting  many  features  is  quite  cumbersome  for  the  radiologists  involved, 
we  investigated  CADx  performance  using  only  a  subset  of  the  features  by  performing  stepwise 
feature  selection.  Of  the  14  selected  features,  three  had  also  been  found  to  have  high  malignancy 
predictive  value  from  a  previous  study  (33):  Stavros  mass  shape,  mammographic  mass  margin, 
and  sonographic  lesion  boundary.  To  assure  that  the  selected  features  were  adequate  to  allow 
the  CADx  system  to  generalize  well  on  new  cases,  a  train-test-retest  scheme  was  required.  Only 
the  train/validate  set  was  used  to  select  the  features,  which  were  then  tested  in  a  CADx  model  on 
the  retest  set.  As  shown  in  Figure  2,  an  LDA  with  only  the  14  stepwise-selected  features 
performed  just  as  well  as  an  LDA  with  all  37  features.  The  small  number  of  features  required  for 
good  performance  suggests  that  this  CADx  model  may  be  able  to  offer  the  benefit  of  a  second 
reader  to  a  clinician  without  greatly  slowing  the  clinician’s  workflow. 

Figure  2  also  shows  that  the  LDA  distinguished  benign  from  malignant  lesions  no  differently  than 
did  the  radiologists’  assessment  scores  for  our  data  set.  Note  that  for  this  data  set,  the  actual 
positive  predictive  value  of  the  clinical  decision  to  refer  to  biopsy  was  37%,  which  is  typical  of  this 
institution.  Also,  since  this  study  included  only  biopsy-verified  cases,  over  this  special  population 
the  sensitivity  for  cancer  detection  is  by  definition  100%  and  the  specificity  is  0%.  The  results  of 


this  study  suggest  that  the  radiologists  may  be  able  to  achieve  considerable  improvements  in 
performance,  such  as  52%  specificity,  60%  PPV,  and  98%  NPV  by  adjusting  their  mental 
threshold  to  reduce  their  sensitivity  slightly  to  98%  sensitivity,  i.e.,  resulting  in  the  delayed 
diagnosis  of  2%  of  actual  cancers  which  may  be  identified  by  interval  change  at  a  short-term 
follow-up  diagnostic  study.  Likewise,  if  the  radiologists  were  hypothetically  to  adopt  all  the 
recommendations  of  the  computer  model,  they  could  have  perhaps  attained  37%  specificity,  53% 
PPV,  and  97%  NPV  at  that  same  98%  sensitivity  level. 

The  radiologists  in  this  study  were  experienced  dedicated  breast  imagers.  It  is  hoped  that  less 
specialized  radiologists  using  such  a  system  could  improve  their  diagnostic  performance  closer  to 
that  of  breast  specialists.  In  practice,  it  remains  to  be  determined  how  radiologists  would  use  the 
results  from  such  computer  models,  in  particular  whether  they  would  modify  their  biopsy 
recommendation  in  order  to  refer  to  short-term  follow-up  those  cases  deemed  to  be  very  likely 
benign.  It  also  remains  unknown  whether  the  2%  of  cancers  mistakenly  referred  to  follow  up 
would  prove  to  remain  early  stage  such  as  with  the  current  clinical  practice  of  following  probably 
benign  cases. 

As  described  in  a  previous  study  using  this  data  set  (33),  this  study’s  weaknesses  with  the  BI¬ 
RADS  data  collection  included  the  possibility  of  multiple  lesions  per  patient,  the  limitation  to  solid 
masses  rather  than  cysts,  and  the  inclusion  of  only  biopsy-proven  lesions  in  the  study. 
Additionally,  radiologists  allowed  the  mammogram  to  influence  their  recording  of  the  sonographic 
features,  because  they  analyzed  the  mammogram  immediately  before  the  sonogram.  The  study 
was  organized  in  this  manner  to  better  reflect  actual  clinical  practice  in  which  the  mammogram  is 
obtained  immediately  prior  to  the  sonogram  and  decision  are  made  using  all  available  data.  They 
also  could  have  shifted  their  diagnostic  sensitivity  and  specificity  levels  from  their  usual  clinical 
levels  because  they  were  aware  that  the  cases  had  been  resolved  and  therefore  their 
assessment  ratings  did  not  directly  affect  patient  care. 


In  conclusion,  the  models’  good  classification  and  generalization  performance  on  our  data  set 
suggest  that  the  models  could  be  used  as  a  computer-aided  diagnosis  (CADx)  system  for  future 
mass  lesions.  Since  the  LDA  threshold  values  generalized  well,  the  desired  operating  point  on  the 
ROC  curve  could  be  set  for  future  cases,  increasing  the  usefulness  of  the  CADx  system.  Because 
the  stepwise-selected  features  were  adequate  for  good  classification  and  generalization,  they 
could  be  used  in  a  CADx  system  that  would  require  only  minimal  feature  collection  and  burden  on 
the  clinician’s  workflow.  In  this  study  we  were  not  trying  to  improve  diagnostic  accuracy  of 
dedicated  breast  imagers,  but  rather  we  hope  to  offer  a  tool  to  radiologists  specializing  in  other 
specialties  that  will  allow  a  substantial  decrease  in  the  number  of  unnecessary  benign  breast 
biopsies  will  minimizing  the  number  of  delayed  breast  cancer  diagnoses. 
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Tables 


Table  1 :  Generalization  of  the  LDA  ROC  Curve 


Performance 

measure 

Cross  validation  on 
train/validate  set 

Retest  on  retest  set 

p-value  for  difference 
in  means 

AUC 

0.92  ±0.01 

0.92  ±  0.02 

0.81 

0.90  AUC 

0.54  ±  0.08 

0.52  ±  0.08 

0.87 

Spec  at  98%  sens 

0.34  ±0.13 

0.37  ±0.10 

0.89 

PPV  at  98%  sens 

0.44  ±0.06 

0.53  ±  0.05 

0.21 

NPV  at  98%  sens 

0.97  ±  0.02 

0.97  ±0.01 

0.92 

Caption:  Table  1  shows  the  LDA’s  generalization  by  comparing  the  LDA’s  classification 
performance  for  100-fold  cross  validation  on  the  train/validate  set  (the  original  500  cases)  to  the 
performance  on  the  retest  set  (the  latest  303  cases).  The  first  column  contains  the  various  ROC 
performance  metrics,  whereas  the  LDA’s  score  on  these  metrics  are  appears  in  column  2  for  the 
train/validate  set  and  in  column  3  for  the  retest  set.  The  values  are  shown  at  the  mean  plus  or 
minus  one  standard  deviation,  as  determined  by  bootstrap  analysis  of  the  ROC  curves.  The  last 
column  shows  the  two-tailed  p-values  for  the  difference  in  the  two  sets’  performance  metric 
values,  as  determined  by  two-sided  nests.  The  LDA  achieved  high  classification  performance. 
Since  there  were  no  statistically  significant  differences  between  the  performance  metrics,  the  LDA 
performed  equivalently  on  cross  validating  on  the  train/validate  set  and  on  retesting  on  the  retest 


set:  The  LDA  generalized  well. 


Table  2:  Generalization  of  the  LDA  Threshold 


LDA 

Threshold 

TPF  on  Validate 
ROC 

TPF  on  Retest 
ROC 

FPF  on  Validate 
ROC 

FPF  on  Retest 
ROC 

0.0782 

0.953 

0.945 

0.429 

0.466 

0.0373 

0.976 

0.976 

0.613 

0.591 

0.0201 

0.982 

0.984 

0.728 

0.739 

0.0098 

1 

1 

0.879 

0.886 

Caption:  The  LDA  thresholds  from  the  validation  ROC  curve  generalized  very  well  to  the  retest 
ROC  curve.  The  same  threshold  value  determined  very  similar  true-positive  fraction  (sensitivity) 
and  false-positive  fraction  (1 -specificity)  operating  points  on  both  ROC  curves.  Such  performance 
stability  is  clinically  important  for  computer-aided  diagnosis  (CADx)  systems;  knowing  the  CADx 
operating  point  helps  the  clinician  to  incorporate  it  into  an  overall  diagnostic  decision. 


Table  3:  LDA  vs.  Radiologists’  Overall  Gut  Assessment  on  the  Retest  Set 


Performance  measure 

LDA 

Radiologists’  overall  gut 
assessment 

p-value  for  difference 
in  means 

AUC 

0.92  ±  0.02 

0.92  ±  0.02 

0.98 

0.90  AUC 

0.52  ±  0.08 

0.52  ±  0.06 

0.98 

Spec  at  98%  sens 

0.37  ±0.10 

0.52  ±  0.08 

0.25 

PPV  at  98%  sens 

0.53  ±  0.05 

0.60  ±0.05 

0.25 

NPV  at  98%  sens 

0.97  ±0.01 

0.98  ±0.01 

0.25 

Caption:  The  table  compares  the  LDA  to  the  radiologists’  overall  gut  assessment  on  the  retest 
set.  Column  1  lists  the  ROC  performance  metrics,  column  2  the  LDA’s  performance,  column  3  the 
radiologists’  performance,  and  column  4  the  two-tailed  p-value  for  the  difference  in  means.  The  p- 
values  and  errors  on  the  classification  performance  metric  values  were  determined  by  ROC 
bootstrap  analysis.  Both  the  LDA  and  the  radiologists  achieved  excellent  classification 
performance  and  performed  equivalently,  with  no  statistically  significant  performance  differences 
between  them. 


Figures 

Figure  1  a:  Full  ROC  Curves:  Validation  vs.  Retest 


ROC  Curves  for  Classifier  Performance 


Figure  1  b:  Partial  ROC  Curves:  Cross  Validation  vs.  Retest 


Partial  ROC  Curves  for  Classifier  Performance 


Figure  1  Caption:  Both  the  LDA  and  the  ANN  generalized  well  on  the  retest  data  set,  as  shown  by 
their  overlapping  ROC  curves.  The  validation  ROC  curves  (solid  curves)  lie  very  close  to  the 
retest  ROC  curves  (dashed  curves).  The  LDA  and  ANN  had  virtually  indistinguishable 


classification  performances. 


TPF  (Sensitivitv) 


Figure  2  a:  Full  ROC  Curves:  LDA  vs.  Radiologist,  Retest  Set 


ROC  Curves  for  LDA  vs.  Radiologist,  Retest  Set 


Figure  2  b:  Partial  ROC  Curves:  LDA  vs.  Radiologist,  Retest  Set 


Partial  ROC  Curves  for  LDA  vs.  Radiologist,  Retest  Set 


Figure  2  Caption:  Shown  here  are  the  ROC  curves  for  the  LDA  with  all  features,  for  the  LDA  with 
the  stepwise-selected  features,  and  for  the  radiologists’  assessment  of  malignancy.  In  retesting, 
the  LDA,  both  using  all  features  and  using  the  stepwise-selected  features,  performed  very 
similarly  to  the  radiologists’  overall  gut  assessment  scoring.  There  were  no  statistically  significant 
differences  in  any  of  the  performance  metrics  among  the  three  ROC  curves  (p  >  0.2).  Although 
the  radiologist  curve  crossed  over  the  LDA  curves  several  times,  even  at  the  points  of  greater 
divergence,  the  differences  were  not  statistically  significant  (p  >  0.2). 


Figure  3  a:  Histograms  of  the  LDA  Output  Values 


Histograms  of  LDA  Ouputs 
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Figure  3  b:  Histograms  of  the  Radiologists’  Overall  Gut  Assessment 


Histograms  of  Radiologists'  Assessment 
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Figure  3  Caption:  Plotted  above  are  histograms  of  the  LDA  output  values  (a)  and  of  the 
radiologists’  overall  gut  assessment  values  (b).  The  histogram  counts  for  the  truly  benign  lesions 
are  shown  in  gray,  and  those  for  the  truly  malignant  lesions  are  shown  in  black.  For  classification, 
a  threshold  would  be  applied  to  the  LDA  output,  so  that  output  values  below  the  threshold  would 
be  designated  benign  and  those  above  it  would  be  designated  malignant. 


Figure  4  a:  Mammogram  of  Patient  1 


Figure  4a  Caption:  Mediolateral  oblique  mammographic  view  in  52  year-old  woman  demonstrates 
an  oval,  well-circumscribed,  equal  density  mass  in  the  superior  left  breast. 


Figure  4  b:  Sonogram  of  Patient  1 


Figure  4b  Caption:  Ultrasound  views  of  the  mass  demonstrate  an  oval,  hypoechoic  solid  mass 
with  circumscribed  margins,  parallel  orientation  and  posterior  acoustic  shadowing  The 
histopathology  result  indicated  a  benign  fibroadenoma.  Both  the  LDA  and  radiologist  correctly 


considered  this  case  very  benign,  giving  scores  of  0.02/1 .00  and  0/100,  respectively. 


Figure  5  a:  Mammogram  of  Patient  2 


Figure  5a  Caption:  Mediolateral  oblique  mammographic  view  in  57  year-old  woman 
demonstrates  an  ill-defined,  irregularly-shaped,  equal  density  mass  in  the  superior  right 


breast. 


Figure  5  b:  Sonogram  of  Patient  2 


Figure  5b  Caption:  US  views  of  the  mass  demonstrate  an  ill-defined,  irregularly-shaped  mass  with 
posterior  acoustic  shadowing  and  not-parallel  orientation.  Histopathologic  diagnosis  indicated  that 
this  malignant  lesion  was  invasive  ductal  carcinoma.  Both  the  LDA  and  radiologist  correctly 


considered  this  case  very  malignant,  with  scores  of  0.99/1 .00  and  95/100,  respectively. 


Figure  6  a:  Mammogram  of  Patient  3 


Figure  6a  Caption:  Mediolateral  oblique  mammographic  view  in  26  year-old  woman  demonstrates 
an  ill-defined,  oval-shape,  equal  density  mass  in  the  posterior  left  breast. 


Figure  6  b:  Sonogram  of  Patient  3 


Figure  6  Caption:  Sonographic  views  of  the  mass  demonstrate  an  oval,  circumscribed  mass  with 
parallel  orientation  and  no  posterior  acoustic  features.  Histopathologic  diagnosis  indicated  that 
this  lesion  was  necrotic  breast  tissue.  Follow-up  exams  confirm  no  interval  change  two  years  post 
biopsy.  The  LDA  considered  this  case  relatively  benign  with  a  score  of  0.33/1 .00,  whereas  the 


radiologist  considered  it  more  indicative  of  malignancy  with  a  score  of  85/100. 
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ABSTRACT 

Data  sets  with  relatively  few  observations  (cases)  in  medical  research  are  common,  especially  if  the  data  are 
expensive  or  difficult  to  collect.  Such  small  sample  sizes  usually  do  not  provide  enough  information  for  computer 
models  to  learn  data  patterns  well  enough  for  good  prediction  and  generalization.  As  a  model  that  may  be  able  to 
maintain  good  classification  performance  in  the  presence  of  limited  data,  we  used  decision  fusion.  In  this  study,  we 
investigated  the  effect  of  sample  size  on  the  generalization  ability  of  both  linear  discriminant  analysis  (LDA)  and 
decision  fusion.  Subsets  of  large  data  sets  were  selected  by  a  bootstrap  sampling  method,  which  allowed  us  to 
estimate  the  mean  and  standard  deviation  of  the  classification  performance  as  a  function  of  data  set  size.  We  applied 
the  models  to  two  breast  cancer  data  sets  and  compared  the  models  using  receiver  operating  characteristic  (ROC) 
analysis.  For  the  more  challenging  calcification  data  set,  decision  fusion  reached  its  maximum  classification 
performance  of  AUC  =  0.80+0.04  at  50  samples  and  pAUC  =  0.34+0.05  at  100  samples.  The  LDA  reached  a  lower 
performance  and  required  many  more  cases,  with  a  maximum  of  AUC  =  0.68+0.04  and  pAUC  =  0.12+0.05  at  450 
samples.  For  the  mass  data  set,  the  two  classifiers  had  more  similar  performance,  with  AUC  =  0.92+0.02  and  pAUC 
=  0.48+0.02  at  50  samples  for  decision  fusion  and  AUC  =  0.92+0.03  and  pAUC  =  0.55+0.04  at  500  samples  for  the 
LDA. 

Keywords:  Decision  Fusion,  Computer-Aided  Diagnosis,  Sample  Size,  Receiver  Operating  Characteristic  (ROC) 
Curve,  Classification,  Breast  Cancer 


1.  INTRODUCTION 

Many  medical  data  sets  are  difficult  and  expensive  to  collect,  often  resulting  in  limited  data  set  size.  A  small  number 
of  cases  usually  precludes  accurate  predictive  modeling.  Early  modeling  offers  many  advantages,  such  as  earlier 
identification  of  data  collection  problems,  of  unsatisfactory  patient  sampling,  of  expensive  but  uninformative 
features,  and  perhaps  earlier  discovery  of  flaws  in  the  scientific  experiment  design.  Many  medical  experiments 
expose  subjects  to  possibly  avoidable  risk  that  could  be  detected  by  better  and  earlier  modeling. 

The  amount  of  available  data  affects  each  model  differently.  Model  complexity  tends  to  produce  a  tradeoff  between 
modeling  power  and  generalization;  simpler  models  may  be  more  robust  to  noise  in  the  data  but  may  not  be  able  to 
capture  the  full  complexity  of  the  data’s  patterns,  whereas  more  complicated  models  may  model  the  patterns  better 
but  are  more  susceptible  to  overfitting.  In  addition  to  the  number  of  samples  available,  the  ratio  of  number  of 
features  to  number  of  samples  can  also  affect  classifier  performance.  Many  classical  models  tend  to  overtrain  on 
data  sets  with  few  samples  and  many  features.  This  overtraining  effect  becomes  more  pronounced  with  smaller 
sample  size. 

In  this  study,  we  investigated  the  effect  of  sample  size  on  the  generalization  ability  of  two  computer-aided  diagnosis 
(CADx)  models.  The  first  model  was  linear  discriminant  analysis  (LDA),  a  common  CADx  model  for  breast  cancer 
data.  The  second  model  was  a  decision-fusion  method  that  has  shown  promise  for  small,  noisy  data  sets*.  Our 
decision-fusion  technique  offers  the  significant  advantage  that  it  can  reduce  the  dimensionality  of  the  feature  space 


of  the  classification  problem  by  assigning  a  classifier  to  each  feature  separately.  Considering  only  one  feature  at  a 
time  greatly  reduces  the  complexity  of  the  problem  by  avoiding  the  need  to  estimate  multidimensional  probability 
density  functions  (PDFs)  of  the  feature  space.  Accurately  estimating  multidimensional  PDFs  likely  requires  many 
more  observations  than  a  typical  medical  data  set  contains^.  Considering  only  one-dimensional  PDFs  may  allow  the 
decision-fusion  technique  to  reach  asymptotic  testing  performance  using  many  fewer  cases  than  other  classifiers 
require. 

Other  benefits  of  decision  fusion  are  that  it  is  robust  in  noisy  data^,  is  not  overly  sensitive  to  the  likelihood  ratio 
threshold  values"*,  and  can  handle  missing  data  values^.  Our  decision-fusion  technique  can  also  be  tuned  to  optimize 
arbitrary  performance  metrics  that  may  be  more  clinically  relevant,  unlike  more  traditional  classification  algorithms 
that  optimize  mean  squared  error,  such  as  the  LDA. 


II.  METHODS 


2.1  Data 

This  study  used  two  breast  cancer  data  sets;  one  of  mass  lesions  and  one  of  calcification  lesions. 

The  mass  lesion  data  set  is  an  extension  of  the  earlier  subset  described  by  Hong,  et  al.  from  this  research  group®.  The 
cases  were  collected  between  2000  and  2005  at  Duke  University.  The  data  set  included  803  lesions,  of  which  296 
were  malignant  and  507  were  benign,  and  389  were  palpable  and  414  nonpalpable.  The  patient  ages  ranged  from  17 
to  87  years,  with  a  median  age  of  50  years.  Patients  underwent  both  mammography  and  sonography,  and  outcome 
was  determined  through  definitive  histopathological  diagnosis.  One  of  three  dedicated  breast  radiologists  with  6-11 
years  of  experience  described  each  lesion  using  Breast  Imaging  Reporting  and  Data  System  (BI-RADS™,  American 
College  of  Radiology,  Reston,  VA)^  mammography,  BI-RADS  sonography,  and  Stavros  sonography  descriptors®. 

Of  the  total  38  features,  13  were  mammographic,  22  were  sonographic,  and  3  were  patient  history  features. 

Second,  we  used  a  calcification  data  set  that  consisted  of  1508  mammogram  microcalcification  lesions  from  the 
Digital  Database  for  Screening  Mammography  (DDSM)*,  which  is  publicly  available.  The  outcomes  were  verified 
by  histopathological  diagnosis  and  follow-up  for  certain  benign  cases,  yielding  811  benign  and  697  malignant 
calcification  lesions.  The  feature  groups  were  13  computer-extracted  calcification  cluster  morphological  features,  91 
computer-extracted  texture  features  of  the  lesion  background  anatomy,  2  radiologist-interpreted  findings,  2 
radiologist-extracted  features  from  the  BI-RADS  lexicon  and  patient  age.  In  total,  calcification  data  C  set  had  109 
features  and  a  sample-to-feature  ratio  of  approximately  14:1.  Each  mammogram  was  digitized  with  a  resolution  of 
either  43.5  microns  (Howtek  960  or  MultiRad850  digitizer)  or  50  microns  (Lumisys  200  Laser  digitizer).  We  used  a 
512x512  pixel  ROI  centered  on  the  centroid  of  each  lesion  (using  lesion  outlines  drawn  by  the  DDSM  radiologists) 
for  image  processing  and  for  generating  the  computer-extracted  features.  We  extracted  morphological  and  texture 
(spatial  gray  level  dependence  matrix)  features,  which  were  shown  to  be  useful  in  previous  studies  of  CADx  such  as 
by  Chan,  et  al.^. 

2.1  Decision  Fusion 

For  the  decision-fusion  classifier,  histograms  of  each  feature  were  constructed  as  an  estimate  of  the  probability 
density  in  order  to  construct  an  empirical  likelihood  ratio  for  that  feature.  Then,  a  binary  decision  was  made  by 
comparing  the  likelihood  ratio  value  to  a  given  threshold,  which  in  turn  determined  the  sensitivity  and  specificity  of 
the  decision.  Finally,  the  decision  fusion  theory  allowed  the  individual  binary  decisions  to  be  combined  optimally  to 
produce  one  final  binary  decision. 

First,  each  feature  was  considered  separately  and  classified  by  a  likelihood  ratio  classifier.  According  to  decision 
theory,  the  likelihood  ratio  is  the  optimal  detector  to  determine  the  presence  or  absence  of  a  signal  in  noise*®.  The 
null  hypothesis  (Ho)  was  that  the  signal  is  not  present  in  the  noisy  features,  while  the  alternative  hypothesis  (Hi)  was 
that  the  signal  is  present. 

Hg:X=N 

H,:X  =  S  +  N 


(1) 


The  likelihood  ratio  is  the  probability  of  the  features  under  the  malignant  case  divided  by  the  probability  of  the 
features  under  the  benign  case; 


P{X\H,) 


(2) 


where  p(XIHi)  is  the  PDF  of  the  observation  data  X  given  that  the  signal  is  present,  and  p(XIH())  is  the  PDF  of  the 
data  X  given  that  the  signal  is  not  present.  The  likelihood  ratio  is  optimal  under  the  assumption  that  the  PDFs 
accurately  reflect  the  true  densities.  For  classification,  we  can  apply  a  threshold  value,  x,  to  the  likelihood  ratio  to 
produce  a  binary  decision,  u,  about  the  presence  of  the  signal. 

f  1  if  A  >  T 


u  = 


0  if  A  <  T 
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Since  we  assigned  a  separate  likelihood  ratio  classifier  to  each  of  p  features,  we  applied  a  separate  threshold  to  each 
classifier’s  output  value  to  produce  p  binary  decisions.  A  genetic  algorithm  searched  over  the  joint  set  of  thresholds 
in  order  to  maximize  the  classification  performance  of  the  fused  binary  decisions.  The  genetic  algorithm  search  time 
was  capped  at  30  generations  for  this  study  due  to  computational  cost. 

Decision-fusion  theory  describes  how  to  combine  local  binary  decisions  optimally  to  determine  the  presence  or 
absence  of  a  signal  in  noise"'*^.  The  decision  fuser  optimally  fuses  all  the  local  decisions  according  to  the  operating 
points  on  the  receiver  operating  characteristic  (ROC)  curve  at  which  the  local  decisions  were  made.  Assuming 
statistically  independent  decisions,  the  likelihood  ratio  of  the  fused  classifier  is  a  product  over  the  “yes,  signal 
present’’  (Uj  =  1)  decisions  multiplied  by  a  similar  product  over  the  “no,  signal  absent’’  (Uj  =  0)  decisions. 
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where  Pdj  is  the  probability  of  detection  or  sensitivity,  and  PC  is  the  probability  of  false  detection,  or  (1 -specificity), 
for  the  i*  local  decision.  The  ROC  curve  can  be  computed  from  the  unique  likelihood-ratio  values  of  the  fused 
classifier  as  shown  in  Equation  (5). 
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2.2  Linear  Discriminant  Analysis 

The  baseline  classifier  was  linear  discriminant  analysis  (EDA),  which  served  as  a  benchmark  for  the  linear 
separability  of  the  data  set. 

2.3  Sampling  and  Validation 

In  order  to  study  the  effect  of  sample  size  on  the  classifiers’  performances,  we  randomly  selected  subsets  of  the  data 
sets.  We  varied  the  number  of  selected  cases  from  50  to  500,  which  covers  typical  data  set  sizes  in  preliminary 
CADx  research.  Ten  random  draws  of  each  data  subset  size  were  drawn  to  assess  selection  effects.  On  each  subset, 
both  classifiers  were  trained  and  validated  using  10-fold  cross-validation.  Eor  each  sample  size  such  as  100  cases, 
classifiers  were  developed  using  ten  bootstrap  samples  of  that  number  of  cases,  which  allowed  the  calculation  of  the 
mean  AUC  and  pAUC  values  along  with  their  standard  deviations. 

2.4  Classifier  Comparison 

Each  classifier  was  evaluated  using  ROC  analysis.  Two  clinically  interesting  summary  metrics  of  the  ROC  curve 
were  used;  the  area  under  the  curve  (AUC)  and  the  normalized  partial  area  of  the  curve  (pAUC),  which  is  measured 
above  sensitivity  of  Pd  =  0.9. 


III.  RESULTS 


Figure  1  plots  the  classification  performance  against  the  number  of  cases.  The  classifiers’  performances  were  scored 
both  by  ROC  AUC  (Fig.  la  and  Ic)  and  pAUC  (lb  and  Id). 

On  the  calcification  data  (Fig.  la  and  lb)  decision  fusion  achieved  a  maximum  of  AUC  =  0.80+0.04  at  50  samples 
and  pAUC  =  0.34+0.05  at  100  samples.  The  LDA  had  a  lesser  performance,  with  AUC  =  0.68+0.04  and  pAUC  = 
0.12+0.05  at  450  samples.  The  LDA  had  the  expected  testing  trend  of  slowly  increasing  performance  with 
increasing  sample  size,  but  decision  fusion  showed  the  opposite  trend.  Perhaps  inadequately  trained,  decision  fusion 
decreased  with  sample  size  both  in  AUC  and  pAUC.  Note  that  all  of  these  are  validation  results  from  k-fold  cross- 
validation,  which  normally  should  minimize  effects  of  training  bias. 

For  the  mass  lesion  data  (Fig.  lb  and  Id),  the  two  classifiers’  performances  had  more  similar  trends.  Decision  fusion 
reached  a  maximum  of  AUC  =  0.92+0.02  and  pAUC  =  0.48+0.02  at  50  samples,  and  the  LDA  reached  AUC  = 
0.92+0.03  and  pAUC  =  0.55+0.04  at  500  samples.  No  significant  performance  differences  between  the  classifiers 
were  seen  in  sample  sizes  greater  than  100.  For  very  small  data  sets  of  50  cases,  decision  fusion  outperformed  the 
LDA.  In  both  data  sets,  decision  fusion  approached  its  final  AUC  value  with  many  fewer  cases  than  the  LDA 
required.  All  plots  except  Fig.  lb  showed  that  decision  fusion  had  a  smaller  slope  than  the  LDA. 


Cross  Validation  AUC  vs.  Sample  Size,  Calc  Data  Cross  Validation  pAUC  vs.  Sample  Size,  Calc  Data 


(a)  AUC  vs.  Sample  Size,  Calcification  Data  (b)  pAUC  vs.  Sample  Size,  Calcification  Data 
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Figure  1 :  Classifier  performance  vs.  Sample  Size 


Decision  fusion  significantly  outperformed  the  LDA  on  the  calcification  data  set.  The  performance  difference  was 
greatest  for  small  data  sets.  However,  on  the  larger  data  sets,  the  performance  gap  narrowed  to  0.06.  In  part  (b), 
decision  fusion  achieved  pAUC  =  0.34+0.05  at  100  samples  and  then  fell  to  pAUC  =  0.2+0.02  at  500  samples. 
Although  the  two  classifiers  had  very  similar  performance  on  the  mass  data  set,  decision  fusion  still  outperformed  the 
LDA  for  very  small  sample  sizes. 


IV.  DISCUSSION 

Decision  fusion  had  its  biggest  classification  performance  gain  over  the  LDA  on  the  noisier,  more  nonlinear  data  set, 
the  calcification  data  set.  On  the  mass  data  set,  both  the  LDA  and  decision  fusion  performed  very  similarly  for  data 
sets  larger  than  50  samples.  On  very  small  data  sets  of  50  samples,  which  are  common  among  initial  CADx  studies, 
decision  fusion  outperformed  the  LDA.  For  the  mass  data  set  at  least,  a  particular  strength  of  the  decision-fusion 
algorithm  is  that  it  is  able  to  estimate  asymptotic  testing  performance  with  many  fewer  cases  than  other  classifiers 
require.  Figure  1  shows  that  decision  fusion  was  able  to  achieve  approximately  the  same  testing  performance  with 
50  cases  as  with  500  cases. 

The  general  downward  slope  of  the  decision  fusion  curves  for  the  calcification  data  set  may  be  due  to  inadequate 
training.  For  computational  convenience,  we  limited  the  genetic  algorithm’s  search  time  to  only  30  generations. 
Whereas  30  generations  were  adequate  for  small  data  sets  smaller  than  150  cases,  larger  data  sets  required  more 
genetic  algorithm  generations  for  complete  optimization.  A  much  longer  run  of  3000  generations  on  all  available 
1508  cases  in  the  calcification  lesion  data  set  improved  decision  fusion’s  performance  under  100-fold  cross- 
validation  to  AUC  =  0.85+0.01  and  pAUC  =  0.28+0.03,  which  exceeded  the  performance  for  all  data  points  shown 
in  Fig.  la  and  lb.  A  similar  more  thorough  optimization  on  all  available  803  cases  in  the  mass  data  set  allowed 
decision  fusion  to  reach  AUC  =  0.94+0.01  and  pAUC  =  0.63+0.07,  which  likewise  also  exceeded  the  performances 
in  Fig.  Ic  and  Id. 

The  improvements  were  usually  significant  for  the  more  challenging  calcification  data  set,  but  not  for  the  mass  data 
set.  Such  a  statement  may  not  reflect  the  full  diversity  of  these  data  sets,  which  differ  in  many  respects,  including 
linear  separability,  numbers  of  cases,  numbers  and  types  of  features,  and  feature  correlations.  Future  work  will 
explore  the  contribution  of  such  factors  using  controlled  simulation  data  sets  in  order  to  understand  the  full  potential 
and  limitations  of  the  decision-fusion  technique. 
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